Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study by igerber · Pull Request #351 · igerber/diff-diff

igerber · 2026-04-22T15:58:02Z

Summary

Ship paper-faithful variance_method="bootstrap_refit" for SyntheticDiD — Arkhangelsky et al. (2021) Algorithm 2 step 2. Re-estimates ω̂_b and λ̂_b via two-pass sparsified Frank-Wolfe on each pairs-bootstrap draw, using the fit-time normalized-scale zeta. Opt-in; default remains "placebo".
Cross-surface allow-list extensions so the new enum flows through all surfaces that read variance_method: results.py:960 summary line, business_report.py:602 BR inference-label, synthetic_did.py:695 n_bootstrap result population, power.py SDID guidance strings, SyntheticDiD.__init__ docstring, diff_diff/guides/llms-full.txt. Survey + refit raises NotImplementedError upstream in fit() — Rao-Wu rescaled-weight composition is tracked as a follow-up TODO.
Coverage Monte Carlo study (benchmarks/python/coverage_sdid.py) runs 500 seeds × B=200 × 3 DGPs × 4 methods under H0 and writes benchmarks/data/sdid_coverage.json. Rejection rates at α ∈ {0.01, 0.05, 0.10} and mean-SE / true-SD ratios are transcribed into REGISTRY.md §SyntheticDiD. Headline: refit achieves near-nominal calibration (rej@0.05 ≈ 0.04–0.08) across all 3 DGPs; fixed-weight over-rejects by ~1.8–3.2× on smaller panels; placebo is also near-nominal; jackknife is slightly anti-conservative on smaller panels (matches Arkhangelsky §6.3's reported 98% / 93% coverage pattern).
SyntheticDiD.set_params validation gap: the sklearn-style setter path bypassed constructor validation for the new enum. Extract a shared _validate_config helper; make set_params transactional so a validation failure rolls back touched attributes rather than leaving the instance partially mutated.

Methodology references (required if estimator / math changes)

Method name(s): Synthetic Difference-in-Differences bootstrap variance — new variance_method="bootstrap_refit" variant.
Paper / source link(s): Arkhangelsky, Athey, Hirshberg, Imbens, Wager (2021), "Synthetic Difference-in-Differences," AER 111(12). New variant implements Algorithm 2 step 2 verbatim (re-estimate ω̂_b and λ̂_b per draw via Frank-Wolfe). Cross-language anchor: Julia Synthdid.jl::src/vcov.jl:96-103 is the only existing refit-bootstrap implementation; R synthdid::vcov(method="bootstrap") and Stata sdid.ado:1033-1037 both use the fixed-weight shortcut. Full methodology surface + requirements checklist row in docs/methodology/REGISTRY.md §SyntheticDiD.
Any intentional deviations from the source (and why): Survey designs (including pweight-only) composed with bootstrap_refit raise NotImplementedError — Rao-Wu rescaled weights composed with FW re-estimation needs its own derivation (paper is un-survey; R has no survey support). Documented in REGISTRY.md and tracked in TODO.md. FW non-convergence warnings are aggregated into a single summary UserWarning at end-of-loop if the rate exceeds 5% (same threshold as retry-exhaustion), rather than emitting one per bootstrap draw; this preserves the warning signal without 200+ per-fit spam. No deviations from the paper's estimator math, SE formula (sqrt((r-1)/r) × sd(ddof=1) — unchanged), or p-value dispatch (analytical from bootstrap SE — unchanged from PR Fix SyntheticDiD bootstrap p-value dispatch and SE formula #349).

Validation

Tests added/updated:
- tests/test_methodology_sdid.py::TestBootstrapRefitSE — 8 real-fit tests (positive SE, diverges from fixed, tracks placebo on exchangeable DGP, raises on pweight + full-design survey, analytical p-value dispatch, enum validation, summary renders replications)
- tests/test_methodology_sdid.py::TestPValueSemantics::test_refit_p_value_matches_analytical — mirror of the fixed-weight bootstrap regression guard
- tests/test_methodology_sdid.py::TestGetSetParams — 4 new set_params validation tests (accept bootstrap_refit, reject invalid enum, reject incoherent n_bootstrap, allow n_bootstrap=1 under jackknife) + 1 transactional-rollback test
- tests/test_methodology_sdid.py::TestCoverageMCArtifact — schema smoke test on the MC JSON (guarded with pytest.skip if absent per feedback_golden_file_pytest_skip.md)
- tests/test_business_report.py::TestSyntheticDiDBootstrapRefitInferenceLabel — cross-surface guard that BR emits "bootstrap_refit variance" on alpha-override, not the analytical fallback label
- PR Fix SyntheticDiD bootstrap p-value dispatch and SE formula #349's 1e-10 R-parity bit-identity test (TestJackknifeSERParity::test_bootstrap_se_matches_r) still passes — fixed-weight path byte-identical to baseline despite the _bootstrap_se signature expansion
Backtest / simulation / notebook evidence (if applicable): Coverage MC artifact committed at benchmarks/data/sdid_coverage.json (500 seeds × B=200 × 3 DGPs × 4 methods, generated on M-series Mac + Rust backend in ~40 min). Headline calibration table rendered in REGISTRY.md §SyntheticDiD:

DGP	method	α=0.01	α=0.05	α=0.10	mean SE / true SD
balanced	placebo	0.016	0.060	0.086	1.05
balanced	bootstrap	0.106	0.160	0.230	0.85
balanced	bootstrap_refit	0.028	0.078	0.116	1.05
balanced	jackknife	0.066	0.112	0.154	1.08
unbalanced	placebo	0.006	0.032	0.070	1.08
unbalanced	bootstrap	0.036	0.098	0.140	0.94
unbalanced	bootstrap_refit	0.008	0.038	0.080	1.11
unbalanced	jackknife	0.024	0.076	0.120	0.99
AER §6.3	placebo	0.018	0.058	0.086	0.99
AER §6.3	bootstrap	0.034	0.092	0.162	0.88
AER §6.3	bootstrap_refit	0.010	0.040	0.078	1.05
AER §6.3	jackknife	0.030	0.080	0.150	0.90

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Implements Arkhangelsky et al. (2021) Algorithm 2 step 2 as an opt-in variance method that re-estimates ω̂_b and λ̂_b via two-pass sparsified Frank-Wolfe on each pairs-bootstrap draw, using the fit-time normalized- scale zeta. Default remains "placebo". Cross-surface allow-list extensions land in one PR per feedback_cross_surface_parity_audit.md: - SyntheticDiD.fit() dispatcher and _bootstrap_se signature - synthetic_did.py:695 n_bootstrap result population - results.py:960 summary() "Bootstrap replications" gating - business_report.py:602 inference-label allow-list - power.py SDID guidance strings (2 sites) - SyntheticDiD.__init__ docstring and diff_diff/guides/llms-full.txt Survey + bootstrap_refit raises NotImplementedError upstream in fit() (covers both pweight-only and full-design) — the Rao-Wu rescaled-weight composition is tracked as a follow-up TODO. Coverage MC study (benchmarks/python/coverage_sdid.py) runs 500 seeds × B=200 × 3 DGPs × 4 methods under H0 and writes benchmarks/data/sdid_coverage.json (4.4 KB). Rejection rates at α ∈ {0.01, 0.05, 0.10} and mean SE / true SD ratios are transcribed into REGISTRY.md §SyntheticDiD. Headline: refit achieves near-nominal calibration across all 3 DGPs; fixed-weight over-rejects by roughly 1.8–3.2× on smaller panels, consistent with the SE under-estimate from ignoring weight-estimation uncertainty. Tests: TestBootstrapRefitSE (8 tests) + test_refit_p_value_matches_analytical in TestPValueSemantics + TestCoverageMCArtifact schema smoke test (guarded with pytest.skip per feedback_golden_file_pytest_skip.md) + cross-surface BR inference-label test. PR #349's 1e-10 R-parity bit-identity gate still passes. Per-draw Frank-Wolfe non-convergence UserWarnings are suppressed inside the refit loop and aggregated into a single summary warning at end-of- loop if the rate exceeds 5% — the same threshold the retry-exhaustion guard uses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AI review caught that the sklearn-style setter path bypassed the constructor's enum/coherence checks, so users could ``set_params(variance_method='not_a_method')`` after construction and slip past the __init__ validation added for ``bootstrap_refit``. Extract the existing checks into a private ``_validate_config()`` helper and call from both ``__init__`` and ``set_params`` so both paths enforce the same contract. Constant-hoist the valid-methods tuple onto the class as ``_VALID_VARIANCE_METHODS`` so __init__ and the validator share a single source. Add regression tests under ``TestGetSetParams``: - set_params accepts ``bootstrap_refit`` - set_params rejects unknown variance_method (parity with __init__) - set_params rejects incoherent n_bootstrap < 2 when method != jackknife - set_params allows n_bootstrap=1 under jackknife (deterministic) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Follow-up to the review's P2 finding: if ``_validate_config`` rejects the post-update state in a multi-attribute ``set_params`` call, the instance was left with partially-applied (invalid) values after the raised ``ValueError``. Snapshot originals before any setattr and restore them in an except handler so the raise leaves the object consistent with its pre-call configuration. Regression test asserts post-raise state matches the pre-call state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-22T16:05:09Z

Overall Assessment

✅ Looks good

Executive Summary

No unmitigated P0/P1 issues found. The new bootstrap_refit path re-estimates both ω and λ per draw and keeps the documented SE / p-value semantics aligned with the Methodology Registry.
The bootstrap_refit + survey-design exclusion is explicitly documented in the registry and tracked in TODO.md, so it is informational rather than blocking.
P3: the aggregated Frank-Wolfe non-convergence warning is counted on raw solver warnings rather than draw-level failures, so it can fire earlier than the registry text implies.
P3: a few in-code docstrings still describe the pre-PR variance-method surface.
P3: the existing scale-equivariance regression suite was not extended to bootstrap_refit, so the new normalization-sensitive path lacks a direct guard.

Methodology

No unmitigated findings. The refit path in diff_diff/synthetic_did.py:L583 and diff_diff/synthetic_did.py:L1071 re-estimates both weights per draw, and the non-placebo inference dispatch in diff_diff/synthetic_did.py:L642 matches the registry notes in docs/methodology/REGISTRY.md:L1505 and docs/methodology/REGISTRY.md:L1555.

Severity P3. Impact: bootstrap_refit still rejects any survey design, but that deviation is explicitly documented in docs/methodology/REGISTRY.md:L1513 and tracked in TODO.md:L103, so it is mitigated and does not block approval. Concrete fix: none for this PR; the follow-up Rao-Wu + refit derivation is already recorded.

Code Quality

Severity P3. Impact: the refit non-convergence summary warning counts raw solver warnings in diff_diff/synthetic_did.py:L1092 and compares them to successful draws in diff_diff/synthetic_did.py:L1173, while the registry text says the warning should trigger when the non-convergence rate exceeds 5% of draws in docs/methodology/REGISTRY.md:L1513. That can emit the warning below the documented draw-level threshold when both omega and lambda warn on the same draw. Concrete fix: count “any non-convergence on this draw” as a boolean, or normalize against 2 * n_successful if the intended unit is solver calls.

Performance

No findings. The extra cost of bootstrap_refit is opt-in and documented in diff_diff/synthetic_did.py:L56 and docs/methodology/REGISTRY.md:L1513.

Maintainability

No unmitigated findings.

Tech Debt

Severity P3. Impact: placebo_effects now carries placebo draws, fixed-bootstrap draws, refit-bootstrap draws, and jackknife LOO values, so the name is increasingly misleading after this PR. This is already tracked in TODO.md:L129. Concrete fix: none required for approval; follow the tracked rename to a neutral name such as variance_effects.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the in-code result/API docs still describe only the old variance-method surface in diff_diff/results.py:L813 and diff_diff/results.py:L815, and the fit() docstring does not describe the new NotImplementedError survey restriction in diff_diff/synthetic_did.py:L234 and diff_diff/synthetic_did.py:L247. Concrete fix: update those docstrings to include bootstrap_refit, its per-iteration payload semantics, and the survey-support limitation.
Severity P3. Impact: the existing normalization regression suite still excludes bootstrap_refit in tests/test_methodology_sdid.py:L2496, tests/test_methodology_sdid.py:L2510, and tests/test_methodology_sdid.py:L2542, even though the new path depends on the same Y_scale / zeta / Y_scale contract documented in docs/methodology/REGISTRY.md:L1579. Concrete fix: add bootstrap_refit to those parametrizations or add a dedicated smaller scale-equivariance regression for the refit branch.

Execution note: I could not run the test suite in this sandbox because pytest and numpy are not installed here.

…ression Three P3 items from the CI AI review, all under Documentation/Tests and Code Quality (no methodology change): 1. **FW warning counting**: the refit loop counted raw solver warnings (up to 3 per draw: ω pre-sparsify, ω main, λ main), not draws. REGISTRY text describes the rate per valid bootstrap draw, so the two could diverge by up to 3×. Change counts "any non-convergence on this draw" as boolean 1 and tighten the REGISTRY wording to call out the per-draw convention explicitly (each draw runs Frank-Wolfe once for ω and once for λ; either can trip the draw). 2. **Docstring gaps**: `fit()` now documents the new survey + refit NotImplementedError under the `survey_design` parameter and in the Raises block. `SyntheticDiDResults.variance_method` and `SyntheticDiDResults.placebo_effects` fields now list `bootstrap_refit` and describe the four-way semantics of `placebo_effects` under each variance method. 3. **Scale-equivariance regression**: add `"bootstrap_refit"` to the `variance_method` parametrizations in `TestScaleEquivariance` (baseline-parity, scale-equivariance across ~15 orders of magnitude, and extreme-scale-detection). Capture a bit-identity baseline for the refit path on `_make_panel(seed=42)`, so the normalization-sensitive branch has its own drift guard, not inferred from sibling methods. All 37 SDID targeted tests + 361 broader sweep tests pass; fixed-weight 1e-10 R-parity bit-identity gate preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T16:29:12Z

/ai-review

github-actions · 2026-04-22T16:40:22Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6a033267682aa63852637ca5206c4eab06808c80

Overall Assessment

⚠️ Needs changes

Executive Summary

[Newly identified] The PR now codifies the mapping “bootstrap = R-compatible fixed-weight shortcut, bootstrap_refit = paper-faithful refit” across the registry, public docstrings, changelog, and new regression tests, but the official synthdid source currently accessible does not support that blanket “matching R” claim. It stores update.omega / update.lambda in attr(estimate, "opts"), passes those opts back through bootstrap_sample(), and notes the supplied bootstrap weights are used only for initialization. (raw.githubusercontent.com)
Previous non-blocking findings from the last review look addressed: the BusinessReport allow-list now includes bootstrap_refit, SyntheticDiDResults.summary() shows bootstrap replications for refit, and the scale-equivariance suite now covers the new variance path.
The bootstrap_refit survey-design exclusion is documented in the methodology registry and tracked in TODO.md, so it is informational rather than blocking.
I could not execute the test suite in this sandbox because pytest, numpy, and pandas are not installed.

Methodology

Severity P1 [Newly identified]. Impact: The changed method description in docs/methodology/REGISTRY.md:L1497-L1513, diff_diff/synthetic_did.py:L54-L59, diff_diff/synthetic_did.py:L912-L936, diff_diff/results.py:L813-L819, and CHANGELOG.md:L11-L12 says the existing variance_method="bootstrap" is the R-compatible fixed-weight bootstrap and positions bootstrap_refit as the new paper-faithful variant. The official synthdid source currently accessible contradicts that as written: synthdid_estimate() persists update.omega / update.lambda in attr(estimate, "opts"), bootstrap_sample() feeds those opts back into the estimator, and the package source says the supplied weights are used only for initialization in bootstrap/placebo SEs. That makes the current fixed-weight path an undocumented deviation from the cited reference implementation, and the new fixed-vs-refit regression in tests/test_methodology_sdid.py:L568-L596 now hard-codes that mismatch into the test surface. Concrete fix: either (1) document the current bootstrap path in REGISTRY.md as a **Note (deviation from R):** and remove the “matching R” / “R-compatible” language everywhere, or (2) retarget variance_method="bootstrap" to the official synthdid bootstrap semantics and rename the fixed-weight shortcut; in both cases, update the new regression/parity language to match. (raw.githubusercontent.com)
Severity P3. Impact: bootstrap_refit still rejects any survey design, but that limitation is explicitly documented in docs/methodology/REGISTRY.md:L1513 and tracked in TODO.md:L103. Concrete fix: none required for approval.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: placebo_effects now spans placebo draws, fixed-bootstrap draws, refit-bootstrap draws, and jackknife LOO values; the name is misleading but already tracked in TODO.md:L129. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

No separate unmitigated findings beyond the methodology-labeling issue above. The prior doc/test gaps from the last review appear addressed in the changed files.

Path to Approval

Resolve the public method mapping for SDID bootstrap variance: either document the current fixed-weight bootstrap path as a REGISTRY.md deviation from official synthdid, or change bootstrap to match the official source and rename the fixed-weight shortcut.
Remove or rewrite the new “matching R” / “R-compatible” statements in docs/methodology/REGISTRY.md:L1497-L1513, diff_diff/synthetic_did.py:L54-L59, diff_diff/synthetic_did.py:L912-L936, diff_diff/results.py:L813-L819, and CHANGELOG.md:L11-L12.
Update the new fixed-vs-refit regression and any parity messaging so the tests enforce the approved semantics instead of encoding the current unsupported “R-compatible fixed-weight bootstrap” claim.

… deviation Tracing R's source (vcov.R::bootstrap_sample and synthdid.R) shows that R's default synthdid::vcov(method="bootstrap") rebinds attr(estimate, "opts") — which includes update.omega=TRUE from the original fit — back into synthdid_estimate inside its do.call, so the renormalized ω is used only as Frank-Wolfe initialization and ω and λ are re-estimated per draw. R's default bootstrap is refit, not fixed- weight. The sum_normalize helper in R's source explicitly comments that the supplied weights "are used only for initialization" in bootstrap and placebo SEs. Our variance_method="bootstrap" holds the renormalized ω exactly (no FW re-run). It is therefore a deliberate deviation from R's default. Our PR #349 fixture generator at benchmarks/R/... is a manual fixed-weight invocation — it omits the opts rebind, which defaults update.omega to FALSE given non-null weights. The 1e-10 parity test anchors our fixed-weight path to that manual R invocation, not to R's real vcov behavior. Documentation-only fix across all claim sites; no methodology or code behavior changes: - REGISTRY.md §SyntheticDiD: label the fixed-weight bootstrap as "Alternative: Bootstrap at unit level — fixed-weight shortcut"; add explicit **Note (deviation from R)** citing the vcov.R / synthdid.R opts-rebind mechanism; call out bootstrap_refit as matching R's default vcov. Requirements checklist entries and R-parity test scope Note rewritten to match. - diff_diff/synthetic_did.py: __init__ docstring and _bootstrap_se method docstring drop the "matching R" framing on the fixed-weight path; bootstrap_refit is flagged as matching R's default. - diff_diff/results.py: SyntheticDiDResults.variance_method field doc fixed (I introduced the "R-compatible fixed-weight shortcut" misphrasing in round 1; it was wrong). - CHANGELOG.md Unreleased/Added: Bundle A entry clarifies that bootstrap_refit matches R's default and the existing fixed-weight bootstrap is now explicitly documented as a deviation. - benchmarks/R/generate_sdid_bootstrap_parity_fixture.R: loop comment calls out the non-default invocation shape (no opts rebind → runs fixed-weight); references the Python test that consumes this fixture. - tests/test_methodology_sdid.py::test_bootstrap_se_matches_r docstring: rewritten to scope the parity check correctly (manual R fixed-weight, not R's default vcov). - TODO.md: add a new row for the refit cross-language parity anchor (Julia Synthdid.jl or R via the real vcov path) to make the missing anchor explicit. All 57 targeted tests pass; no methodology change, no numerical output change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T17:04:53Z

/ai-review

github-actions · 2026-04-22T17:11:50Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3fa6d8d91ca8a4472d953b775af5f280a8e225d8

Overall Assessment
Looks good.

Executive Summary

The previous P1 around SDID bootstrap/R semantics looks materially resolved: fixed variance_method="bootstrap" is now documented as a deliberate deviation from R default, and the paper-faithful refit path is split out cleanly.
I did not find a new P0/P1 in the changed estimator path. The refit branch re-estimates both weight sets per draw, unsupported survey composition is rejected up front, and inference still routes through the existing NaN-safe path.
The new deferred items are properly tracked in TODO.md, so they are informational rather than blocking.
Residual issues are P3-only: a small amount of wording still overstates exact R-default parity or uses the old “R-compatible shortcut” framing.
I could not execute pytest here because the sandbox does not have it installed.

Methodology
Previous P1 about fixed bootstrap being mislabeled as R-default-faithful appears addressed.

Severity P3. Impact: the public wording still slightly over-claims exact R-default parity for bootstrap_refit. The docs say or imply that the new path matches R’s default bootstrap, but the implementation re-runs the library’s standard fresh-start two-pass Frank-Wolfe helpers rather than seeding unit-weight optimization with the renormalized original ω described in the registry’s own R-trace note. That is a numerical implementation choice, not a paper-methodology defect, so this is informational only, especially because the missing direct refit parity anchor is already tracked. Concrete fix: either weaken the wording to “same refit methodology as R default” or thread the R-style initialization into the refit branch and close the tracked parity follow-up. Refs: docs/methodology/REGISTRY.md:L1505, diff_diff/synthetic_did.py:L61, diff_diff/synthetic_did.py:L1098, diff_diff/utils.py:L1573, TODO.md:L104

Code Quality
No findings.

Performance
No findings.

Maintainability

Severity P3. Impact: a few internal comments and test docstrings still describe fixed-weight bootstrap as an “R-compatible” or “R’s vcov shortcut,” which now conflicts with the corrected registry note and can mislead future reviewers. Concrete fix: rename this wording to “fixed-weight shortcut” everywhere and reserve R-default wording for bootstrap_refit. Refs: diff_diff/synthetic_did.py:L601, diff_diff/synthetic_did.py:L1135, tests/test_methodology_sdid.py:L549

Tech Debt

Severity P3. Impact: the two real follow-ups introduced by this PR are properly documented and tracked, so they do not block approval: survey-design composition for refit and a direct cross-language refit parity anchor. Concrete fix: none required in this PR; keep the follow-ups open. Refs: TODO.md:L103, TODO.md:L104

Security
No findings.

Documentation/Tests
No unmitigated findings. Verification note: I could not run the new tests in this environment because pytest is not installed.

variance_method="bootstrap" now means refit (Arkhangelsky et al. 2021 Algorithm 2 step 2; also R's default synthdid::vcov(method="bootstrap") behavior, which rebinds attr(estimate, "opts") with update.omega=TRUE so the renormalized ω serves only as Frank-Wolfe initialization). The previously-shipped fixed-weight shortcut is removed entirely; the "bootstrap_refit" enum value briefly added in earlier commits of this PR is folded back into "bootstrap". Why this is a correctness fix, not just a relabel: the old fixed-weight "bootstrap" matched neither the paper (which prescribes refit) nor R's default vcov (also refit). The 1e-10 R-parity test from PR #349 anchored fixed-weight Python against a manual R invocation that omitted the opts rebind — both sides were wrong in the same direction. Coverage MC at benchmarks/data/sdid_coverage.json (500 seeds × B=200) confirms the new "bootstrap" tracks placebo near-nominal across the three representative DGPs; the old fixed-weight column over-rejected at α=0.05 at rates 0.16 / 0.098 / 0.092 (1.8-3.2× nominal). Capability regression: SDID + survey designs (pweight-only AND strata/PSU/FPC) now raises NotImplementedError. The removed fixed-weight bootstrap was the only SDID variance method that supported strata/PSU/FPC (via the Rao-Wu rescaled bootstrap branch inside _bootstrap_se). Pweight-only users can switch to variance_method="placebo" or "jackknife"; strata/PSU/FPC users have no SDID variance option on this release. Rao-Wu rescaled weights composed with paper-faithful Frank-Wolfe re-estimation needs a weighted-FW derivation; sketch and reusable scaffolding pointers live in REGISTRY.md §SyntheticDiD's "Note (deferred survey + bootstrap composition)" and TODO.md. The deleted Rao-Wu code (≈48 lines of _bootstrap_se) is recoverable via `git show <THIS_COMMIT>^:diff_diff/synthetic_did.py` near the pre-rewrite _bootstrap_se body. Cross-surface allow-list reverts: the additive "bootstrap_refit" enum shipped in earlier commits of this PR rippled through results.py:960 summary gating, business_report.py:602 inference-label allow-list, power.py SDID guidance strings, llms-full.txt enums, and SyntheticDiDResults field docstrings. All of those are now back to a 3-value surface ("bootstrap", "jackknife", "placebo"). Tests: - TestBootstrapRefitSE class deleted; 4 unique tests folded into TestBootstrapSE (tracks-placebo-exchangeable, raises-pweight-survey, raises-full-design-survey, summary-shows-replications). - test_bootstrap_se_matches_r deleted along with its fixture (tests/data/sdid_bootstrap_indices_r.json) and generator (benchmarks/R/generate_sdid_bootstrap_parity_fixture.R) — they anchored the now-removed fixed-weight path. - TestPValueSemantics::test_refit_p_value_matches_analytical deleted as duplicate of test_bootstrap_p_value_matches_analytical. - TestScaleEquivariance._BASELINE: "bootstrap" row updated to the refit values (4.6033, 0.21424970..., 2.10890881e-102, 200) — bit- identical to the captured "bootstrap_refit" baseline since the new bootstrap path is the same code as the old refit path. Tolerance tightened from rel=1e-8 to rel=1e-14 to enforce bit-identity. - TestGetSetParams: variance_method literals rebound to "bootstrap"; test_set_params_accepts_bootstrap_refit deleted (redundant with constructor tests). - TestCoverageMCArtifact: expected methods list set exact-equal to ("placebo", "bootstrap", "jackknife"). - test_business_report.py inference-label test class + method renamed to drop "refit" suffix; assertion checks for "bootstrap variance". The benchmarks/data/sdid_coverage.json artifact is updated transitionally in this commit (fixed-weight column dropped; refit column renamed to bootstrap) so the schema test stays green; a follow-up commit regenerates from a fresh 500-seed MC re-run with the new code path. The REGISTRY coverage table cells are TBD pending that re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Doc-only follow-up to the previous commit's bootstrap rewrite. Updates every user-facing surface that referenced the (now-removed) fixed-weight bootstrap or the additive bootstrap_refit option: - docs/choosing_estimator.rst: drops the "Via bootstrap" cell from the SDID survey-support row (no SDID variance method supports strata/PSU/FPC anymore); rewrites the misdirecting note steering users to bootstrap for full survey designs; updates the inference summary table description for SDID's variance methods. - docs/survey-roadmap.md: rewrites the SDID limitations table rows to reflect the regression matrix (pweight-only works with placebo / jackknife; strata/PSU/FPC has no SDID variance option in this release; bootstrap rejects all survey designs). - docs/performance-scenarios.md: updates the SE-comparison scenario's timing expectation note (bootstrap is now ~10-100x slower per fit than the previous fixed-weight shortcut). - docs/tutorials/03_synthetic_did.ipynb: rewrites markdown cells 19 (inference methods description) and 29 (summary) — bootstrap is now paper-faithful refit matching R's default vcov, not the prior fixed-weight shortcut. - docs/tutorials/18_geo_experiments.ipynb: rewrites the bootstrap-vs- placebo description (cell t18-cell-028); softens the stakeholder narrative claim "the two methods agree" to acknowledge that on small panels with non-exchangeable factor structure the SE magnitudes can differ while both methods still agree on significance and CI direction (cell t18-cell-033); re-executes the comparison cell so the output reflects the new bootstrap SE = 4.50 (was 4.26 under fixed-weight). The drift-guard asserts at cell t18-cell-026 only pin ATT / conf_int / pre-fit RMSE — none of which change — so no guard updates needed. - diff_diff/synthetic_did.py: fit() docstring's survey_design parameter description is now consistent with the actual guards (no bootstrap_refit references; explicit pweight-only-on-placebo-or-jackknife matrix). - benchmarks/python/coverage_sdid.py: --help text drops the bootstrap_refit mention. - METHODOLOGY_REVIEW.md: the v3.x SyntheticDiD review entry's claim that bootstrap matches R's bootstrap_sample is replaced with an honest description of the corrected refit semantics, plus a parenthetical historical note about the prior fixed-weight shortcut. All 351 targeted tests pass; no methodology or numerical change in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fresh 500 seeds × B=200 run of benchmarks/python/coverage_sdid.py with the new 3-method ALL_METHODS = ("placebo", "bootstrap", "jackknife"). Total wall-clock ~40 min on M-series Mac, Rust backend. Numbers match the transitional rename in commit 1 exactly (rej@0.05: balanced=0.078, unbalanced=0.038, aer63=0.040) — expected bit-identity since the new "bootstrap" path is the same refit code as the previous "bootstrap_refit" path, and the MC uses identical seeds (range(500)). Confirms the rewrite didn't introduce numerical drift. Headline across the three DGPs: - bootstrap (refit): near-nominal at α=0.05 (0.078 / 0.038 / 0.040) versus nominal 0.05; well within 2σ MC band at 500 seeds. - placebo: also near-nominal (0.060 / 0.032 / 0.058). - jackknife: mildly anti-conservative on the smaller panels (0.112 on balanced, 0.080 on AER §6.3), matching the paper's §6.3 pattern. REGISTRY.md §SyntheticDiD coverage-MC table cells replace the _TBD_ placeholders with the transcribed numbers. The narrative under the table (`Reading:` paragraph) was already rewritten in commit 1 for the 3-method surface; no further edit needed. The transitional commit-1 artifact (where I renamed bootstrap_refit → bootstrap in-place without running the MC) carried a `transitional_note` field in the metadata; the fresh run naturally drops that field since it's written from scratch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T20:49:48Z

/ai-review

github-actions · 2026-04-22T20:55:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 15481d9a859f1da8fa6f935d3ea5b3d6c8edbca3

Overall Assessment

✅ Looks good

Executive Summary

The previous substantive methodology concern is resolved: variance_method="bootstrap" now re-estimates both ω and λ per draw, keeps the documented retry-to-B contract, and still routes inference through safe_inference().
I did not find a new unmitigated P0/P1 in the changed estimator path.
The survey-support regression is explicit in code and documentation and is properly tracked in TODO.md, so it is informational rather than blocking.
Remaining issues are P3-only: one stale replicate-weight fallback message, and wording that still overstates verified parity with R’s default bootstrap.
I could not run pytest here because pytest is not installed in this environment.

Methodology

Severity P3. Impact: the refit bootstrap implementation fresh-starts each Frank-Wolfe solve via compute_sdid_unit_weights(...) / compute_time_weights(...), but the changed docs still state that the path matches R’s default vcov(method="bootstrap"), whose own rationale in the registry depends on the renormalized ω being threaded in as initialization. That is an implementation-choice / verification issue, not a demonstrated numerical defect, and it is already tracked in TODO.md, so it should not block approval. Concrete fix: either soften the wording to “paper-faithful refit bootstrap, intended to match R default” or add the TODO-tracked cross-language parity harness. Refs: diff_diff/synthetic_did.py:L924-L935, docs/methodology/REGISTRY.md:L1506-L1508, METHODOLOGY_REVIEW.md:L504-L508, CHANGELOG.md:L15-L18, TODO.md:L104-L104
No other findings. On the changed estimator path, the code now matches the updated registry on the load-bearing methodology points: it re-estimates both weight sets per bootstrap draw, retries degenerate/non-finite draws until n_bootstrap valid replicates or the bounded budget, preserves the sqrt((r-1)/r) * sd(ddof=1) SE formula, and routes bootstrap/jackknife inference through safe_inference() rather than inline t/p/CI code. Refs: diff_diff/synthetic_did.py:L583-L650, diff_diff/synthetic_did.py:L823-L1024, docs/methodology/REGISTRY.md:L1497-L1552

Code Quality

No findings.

Performance

No findings. The documented slowdown is the expected cost of correcting the methodology from fixed-weight bootstrap to per-draw refitting, not an accidental regression. Refs: CHANGELOG.md:L15-L21

Maintainability

Severity P3. Impact: the replicate-weight guard still tells users to “Use a TSL-based survey design (strata/psu/fpc),” but this PR simultaneously removes SDID support for strata/PSU/FPC on all variance methods. Users hitting that path are now sent to an unsupported alternative. Concrete fix: update the exception text to reflect the new contract, e.g. “replicate weights are not supported; pweight-only works with placebo / jackknife, and full survey designs have no SDID variance path in this release.” Refs: diff_diff/synthetic_did.py:L286-L290, diff_diff/synthetic_did.py:L299-L330, docs/methodology/REGISTRY.md:L1549-L1550

Tech Debt

Severity P3. Impact: the two real follow-ups introduced by this change are properly tracked, so they do not block approval: SDID+survey bootstrap composition and a direct cross-language refit parity anchor. Concrete fix: none required in this PR; keep the TODOs open. Refs: TODO.md:L103-L104

Security

No findings.

Documentation/Tests

No unmitigated findings in the changed docs/tests. The added tests cover the main regression surfaces this PR touched: survey rejection, setter validation, summary labeling, business-report labeling, and the committed coverage-artifact schema. Refs: tests/test_business_report.py:L4659-L4700, tests/test_methodology_sdid.py:L501-L638, tests/test_methodology_sdid.py:L1199-L1237, tests/test_methodology_sdid.py:L2848-L2903
Verification note: I could not run the tests here because pytest is not installed in this environment.

CI review on commit 15481d9 flagged the docs as overclaiming parity with R's default synthdid::vcov(method="bootstrap"): R warm-starts Frank-Wolfe from the renormalized fit-time ω per draw (and keeps fit-time λ as FW init for the λ re-estimation), while our Python port was cold-starting from uniform. On the strictly-convex FW objective with simplex constraint, warm- and cold-start converge to the same global minimum given enough iterations — but the 100-iter pre-sparsify pass may not fully converge on some draws, and then sparsification is path-dependent on the init. Port the warm-start shape: - diff_diff/utils.py: compute_sdid_unit_weights and compute_time_weights gain an init_weights=None kwarg, forwarded to _sc_weight_fw for the first pass. When None (default), preserves the Rust top-level fast-path unchanged. When provided, falls through to the Python two-pass dispatcher; inner FW calls still dispatch to Rust via _sc_weight_fw, so the perf cost is one Python call per pass per draw. - diff_diff/synthetic_did.py::_bootstrap_se: thread warm-start per draw. boot_omega_init = _sum_normalize(unit_weights[boot_control_idx]) (same shape as R's sum_normalize(weights$omega[sort(ind[ind <= N0])])). boot_lambda_init = time_weights (fit-time λ unchanged, matching R's weights.boot$lambda = weights$lambda). Baseline drift is ULP-level: TestScaleEquivariance._BASELINE["bootstrap"] SE shifts from 0.21424970247101688 (cold-start) to 0.21427381053829253 (warm-start), a handful of the 200 bootstrap draws converging to a different sparsification pattern than the cold-start pass. rel=1e-14 bit-identity on the warm-start capture holds. Coverage MC: NOT re-run. User course-correction ("not sure we need to run the full MC with every pass here"). Math argument: strictly-convex FW → unique global minimum; aggregated rejection rates shift well under 2σ MC noise (≈0.02–0.05 per cell at B=200 × 500 seeds). The committed benchmarks/data/sdid_coverage.json (generated under cold-start in commit 15481d9) stays accurate for the warm-start path within MC noise. REGISTRY.md coverage-MC Note now documents the artifact regeneration cadence explicitly: regenerate on methodology changes that materially shift per-draw numerics, not on refactors that preserve the FW global optimum (warm-start, backend migration, renames, docstring fixes). Per-seed bit-identity on the captured fixture is the cheaper, stricter parity check. Other changes in this commit: - diff_diff/synthetic_did.py replicate-weight guard message rewritten (R2 P3 maintainability): the old message told users to switch to a TSL-based survey design with strata/psu/fpc, which is no longer supported by any SDID variance method in this release. - docs/tutorials/18_geo_experiments.ipynb re-executed; bootstrap SE displayed in the comparison cell is now 4.44 (warm-start) vs 4.50 (cold-start). Narrative unchanged. - CHANGELOG.md and METHODOLOGY_REVIEW.md describe the warm-start shape and drop the "cross-language bit parity" implication from earlier wording. - REGISTRY.md bootstrap section describes the warm-start mechanism including the FW strict-convexity argument for why warm and cold converge to the same answer but warm matches R's procedure at the algorithm level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T21:28:40Z

/ai-review

github-actions · 2026-04-22T21:36:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 105cf73a28c6e8aae698945be5cea1020fb79e5b

Overall Assessment

⚠️ Needs changes

Executive Summary

The refit SDID bootstrap implementation itself looks methodologically aligned with the updated registry: it re-estimates both ω and λ per draw, keeps the sqrt((r-1)/r) SE aggregation, and still routes bootstrap/jackknife inference through safe_inference().
The prior user-facing survey-message problem appears resolved: the current guard text no longer points users toward unsupported full-design SDID survey paths.
P1 [Newly identified]: the survey-bootstrap removal was not propagated to the legacy SDID survey tests. The new guards in diff_diff/synthetic_did.py:L311-L337 now reject exactly the combinations that tests/test_survey_phase5.py:L179-L196, tests/test_survey_phase5.py:L226-L253, and tests/test_survey_phase5.py:L309-L322 still assert should succeed.
P2: several changed documentation surfaces still describe the old SDID bootstrap contract (fixed weights / Rao-Wu), so the project’s own methodology cross-references are internally inconsistent.
The deferred survey+refit composition work and the cross-language parity anchor are both properly tracked in TODO.md:L103-L104, so they are informational rather than blocking.
I could not run pytest here because this environment is missing runtime test dependencies (pytest, numpy).

Methodology

No unmitigated findings. The changed estimator path in diff_diff/synthetic_did.py:L590-L657 and diff_diff/synthetic_did.py:L830-L1052 matches the updated SDID registry at docs/methodology/REGISTRY.md:L1497-L1552 on the load-bearing points: per-draw refitting of ω̂_b/λ̂_b, retry-to-B semantics, unchanged sqrt((r-1)/r) SE formula, and NaN-safe analytical bootstrap/jackknife inference via diff_diff/utils.py:L174-L208. The survey-bootstrap capability regression is explicitly documented in the registry and tracked in TODO.md, so it is informational rather than a methodology defect.

Code Quality

No findings.

Performance

No findings. The slowdown is the expected cost of replacing the old fixed-weight shortcut with per-draw Frank-Wolfe refitting, and that tradeoff is documented.

Maintainability

Severity P3. _bootstrap_se()’s docstring says unit_weights / time_weights “are not used inside the loop,” but the implementation immediately uses both as warm-start initializations. Impact: this gives maintainers the wrong mental model of what is fixed versus merely reused as initialization. Concrete fix: rewrite diff_diff/synthetic_did.py:L858-L862 to say the original weights are not reused as fixed estimator weights, only as Frank-Wolfe warm starts, matching diff_diff/synthetic_did.py:L873-L877 and diff_diff/synthetic_did.py:L925-L962.

Tech Debt

Severity P3 informational. The two real follow-ups introduced here are explicitly tracked: survey+refit composition and cross-language refit parity in TODO.md:L103-L104. Impact: none for approval. Concrete fix: none in this PR; keep those TODOs open.

Security

No findings.

Documentation/Tests

Severity P1 [Newly identified]. The new survey guards in diff_diff/synthetic_did.py:L311-L337 now make bootstrap + pweight-only survey and bootstrap + full survey design raise NotImplementedError, but the legacy SDID survey tests still assert the old success contract in tests/test_survey_phase5.py:L179-L196, tests/test_survey_phase5.py:L226-L253, and tests/test_survey_phase5.py:L309-L322. Impact: the branch leaves the SDID survey contract internally contradictory and will keep the test suite red once dependencies are present. Concrete fix: rewrite those tests to assert the new NotImplementedError behavior, and replace the old Rao-Wu comparison with positive pweight-only placebo / jackknife cases.
Severity P2. Several changed documentation surfaces still state the old SDID bootstrap story: docs/methodology/REGISTRY.md:L2699-L2699 still says Unit-level bootstrap (fixed weights), docs/methodology/REGISTRY.md:L2889-L2895 still lists SyntheticDiD under Rao-Wu survey bootstrap, docs/survey-roadmap.md:L52-L54 still lists SyntheticDiD among the Rao-Wu survey-bootstrap estimators, and diff_diff/guides/llms-full.txt:L1674-L1676 still says survey-aware bootstrap includes SyntheticDiD via Rao-Wu. Impact: reviewers and downstream doc consumers get conflicting methodology guidance from the project’s own source material. Concrete fix: update those cross-reference summaries to say SDID bootstrap is the paper-faithful refit path for non-survey fits only, and that SDID has no survey-bootstrap support in this release.

Path to Approval

Update the stale SDID survey tests in tests/test_survey_phase5.py so they match the new public contract: variance_method="bootstrap" with any survey_design should raise NotImplementedError, while pweight-only positive coverage should move to placebo / jackknife.

Addresses CI review R3 findings on PR #351: P1: rewrite three legacy SDID survey tests in tests/test_survey_phase5.py that still asserted the old fixed-weight + Rao-Wu success contract so they match the new upstream guards (bootstrap + any survey design and full strata/PSU/FPC with any variance method now raise NotImplementedError). test_full_design_bootstrap_smoke -> test_full_design_bootstrap_raises; test_full_design_se_differs_from_weights_only is replaced by a positive placebo pweight-only smoke test since the old Rao-Wu vs pweight-only SE comparison no longer has two valid paths to compare; test_bootstrap_with _survey -> test_bootstrap_with_pweight_only_raises plus a new test_jackknife_with_pweight_only for positive coverage. Also align the pre-existing placebo/jackknife guard tests with the updated guard wording (same stale-regex class of issue the P1 flagged). P2: remove SyntheticDiD from cross-reference Rao-Wu Rescaled Bootstrap sections that survived the bootstrap rewrite. REGISTRY comparison table row now describes the alternative as paper-faithful refit bootstrap (not fixed weights); the Rao-Wu section lists only SunAbraham and TROP with an explicit Note explaining why SDID is excluded (refit + Rao-Wu composition requires weighted-FW derivation, tracked in TODO.md and sketched in REGISTRY SDID survey-regression Note). survey-roadmap.md Phase 6 bullet and llms-full.txt survey-aware bootstrap bullet updated with the same framing. P3: rewrite _bootstrap_se docstring (diff_diff/synthetic_did.py) to say the fit-time unit_weights and time_weights are Frank-Wolfe warm-start initializations on each draw, not unused parameters retained for signature parity. The in-loop narrative comment at the warm-start site was already correct; the top-of-function docstring was stale from the pre-warm-start commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T21:43:19Z

/ai-review

github-actions · 2026-04-22T21:49:28Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f49a19f8cdb71d7cac7da51f37c5738c58b21543

Overall Assessment

✅ Looks good

No unmitigated P0 or P1 findings remain in the changed PR surface. The core SDID refit-bootstrap path is aligned with the updated methodology registry, and the prior re-review blocker around stale SDID survey tests appears resolved.

Executive Summary

The changed SyntheticDiD bootstrap path is methodologically aligned with the updated registry: it re-estimates ω̂_b and λ̂_b per draw, preserves sqrt((r-1)/r) SE aggregation, and still routes bootstrap/jackknife inference through safe_inference() for all-or-nothing NaN gating. See diff_diff/synthetic_did.py:L590-L657, diff_diff/synthetic_did.py:L830-L1057, diff_diff/utils.py:L174-L206, docs/methodology/REGISTRY.md:L1497-L1551.
The previous P1 around SDID survey tests is resolved. The changed tests now assert the new NotImplementedError contract for bootstrap and full-design survey paths. See tests/test_survey_phase5.py:L179-L216.
The cross-surface result/reporting surfaces look consistent in the current tree: bootstrap replications still render in summary(), and BusinessReport still labels SDID alpha overrides as bootstrap variance instead of falling through to analytical wording. See diff_diff/results.py:L963-L966, diff_diff/business_report.py:L593-L604, tests/test_business_report.py:L4658-L4702.
Remaining issues are minor P3 cleanup only: one deferred-work breadcrumb still contains placeholder commit IDs, one new survey test no longer checks the ATT-equivalence contract stated in its name/docstring, and one new coverage-artifact docstring still says "4 methods" although the artifact/test now use 3.
I could not run pytest here because this environment is missing numpy; verification was limited to static inspection plus compile() parsing of the changed Python files.

Methodology

No findings. The estimator and inference changes in diff_diff/synthetic_did.py:L590-L657 and diff_diff/synthetic_did.py:L830-L1057 match the updated SDID registry in docs/methodology/REGISTRY.md:L1497-L1551, including the documented survey-support regression and the unchanged analytical bootstrap/jackknife inference contract via diff_diff/utils.py:L174-L206.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the deferred survey-bootstrap follow-up is properly tracked, but the newly added breadcrumbs are not directly actionable because they still contain placeholder SHAs (<sdid-bootstrap-refit-removal-sha> and <removal-commit>). Concrete fix: replace those placeholders with the actual commit hash, or rewrite the note to point to a stable PR number / git log --grep lookup instead. See TODO.md:L103-L104, docs/methodology/REGISTRY.md:L1550-L1550.

Security

No findings.

Documentation/Tests

Severity P3. Impact: test_placebo_full_vs_pweight_only_att_match is named and documented as an ATT-equivalence regression check, but its body now only asserts that the pweight-only fit is finite. That means the claimed "stripping strata/PSU should not change ATT" contract is no longer actually tested. Concrete fix: either compare the pweight-only ATT against the same data with strata/PSU/FPC columns ignored/dropped, or rename the test/docstring to describe it as a smoke test. See tests/test_survey_phase5.py:L227-L252.
Severity P3. Impact: the new coverage-artifact test docstring still says the MC study covers "4 methods," while the artifact schema and assertions immediately below now enforce exactly 3 methods (placebo, bootstrap, jackknife). That leaves stale methodology wording in the new regression surface. Concrete fix: update the docstring to say "3 methods." See tests/test_methodology_sdid.py:L2865-L2907.

- Replace placeholder commit SHAs in TODO.md and REGISTRY.md deferred- work breadcrumbs with the actual fixed-weight-removal commit hash (91082e5 "Replace SDID fixed-weight bootstrap with paper-faithful refit"). The follow-up implementer can now directly run `git show 91082e5:diff_diff/synthetic_did.py` to recover the Rao-Wu scaffolding (rw_control/rw_treated split, degenerate-retry check, treated-mean weighting). Also cite PR #351 for stability against commit-hash churn. - Rename tests/test_survey_phase5.py::test_placebo_full_vs_pweight_only _att_match to test_placebo_with_pweight_only_full_design_stripped_att _match and promote the body from a single-fit smoke to the ATT- equivalence check the docstring claims: fit with two equivalent pweight-only SurveyDesign constructions and assert ATT matches bit- for-bit (abs=1e-12). The intent is to confirm strata/psu/fpc columns sitting on the DataFrame are not tacitly read unless the SurveyDesign references them. - Fix stale "3 DGPs × 4 methods" docstring in tests/test_methodology_sdid.py::TestCoverageMCArtifact to read "3 DGPs × 3 methods", matching the 3-method schema the test now enforces (placebo / bootstrap / jackknife; bootstrap_refit and the old fixed-weight 'bootstrap' enum are both gone). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T21:53:07Z

/ai-review

github-actions · 2026-04-22T22:01:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ec92d39b9fbf6d16f5e7642bf534b278a6d35a83

Overall Assessment
Looks good

No unmitigated P0 or P1 findings in the changed PR surface.

Executive Summary

The refit bootstrap implementation re-estimates omega_b and lambda_b on each draw, preserves the sqrt((r-1)/r) SE aggregation, and still routes bootstrap and jackknife inference through safe_inference(), consistent with the updated registry. diff_diff/synthetic_did.py#L590 diff_diff/synthetic_did.py#L842 docs/methodology/REGISTRY.md#L1497
The SDID survey-support regression is explicit rather than silent: full survey designs are rejected for all SDID variance methods, and any survey design is rejected for variance_method='bootstrap'. Because that behavior is documented in the registry and tracked in TODO, it is P3 informational only. diff_diff/synthetic_did.py#L306 docs/methodology/REGISTRY.md#L1549 TODO.md#L103
Cross-surface reporting still looks coherent: SDID summaries retain the bootstrap replications line, and BusinessReport's alpha-override path still labels SDID fits as bootstrap variance instead of falling through to the analytical label. diff_diff/results.py#L963 diff_diff/business_report.py#L602 tests/test_business_report.py#L4658
The sklearn-style set_params() validation gap appears fixed and rollback-covered. diff_diff/synthetic_did.py#L1447 tests/test_methodology_sdid.py#L1199
The prior minor re-review issues are resolved: the TODO breadcrumb now uses a real commit hash, and the coverage-artifact test now correctly describes a 3-method artifact. TODO.md#L103 tests/test_methodology_sdid.py#L2865
One minor test-quality issue remains: the new survey regression test still does not actually exercise the contract its docstring describes. tests/test_survey_phase5.py#L227

Methodology

No unmitigated findings. The changed estimator and variance code is internally consistent with the methodology registry on Algorithm 2 refit bootstrap, analytical bootstrap/jackknife p-values, and NaN-safe inference. diff_diff/synthetic_did.py#L644 diff_diff/synthetic_did.py#L842 docs/methodology/REGISTRY.md#L1497
Severity P3. Impact: SDID survey bootstrap and full-design survey support regresses in this release, but the deviation is explicitly documented and tracked, so it is informational only under the review policy. Concrete fix: no approval blocker; the weighted-Frank-Wolfe survey composition follow-up is already specified in TODO.md#L103.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: external parity for the new refit-bootstrap semantics is still deferred; confidence currently rests on same-library regression tests plus the committed MC artifact rather than a live R/Julia anchor. Concrete fix: implement the cross-language parity harness already tracked in TODO.md#L104.

Security

No findings.

Documentation/Tests

Severity P3. Impact: test_placebo_with_pweight_only_full_design_stripped_att_match compares two semantically identical pweight-only SurveyDesigns, so it cannot catch the regression its docstring claims to cover. A future silent pickup of stratum or psu columns would still pass this test. Concrete fix: compare the pweight-only fit against the same dataset after dropping or renaming stratum and psu, or against a copy where those columns are absent. tests/test_survey_phase5.py#L227

Verification note: runtime execution was not possible here because this environment is missing numpy and pytest; review was limited to static inspection and AST parsing.

The previous rewrite compared two semantically identical pweight-only SurveyDesign constructions, which can't catch the contract the docstring claimed — if a future change silently picked up `stratum` or `psu` by name, both fits would pick them up identically and the test would still pass. Rewrite to compare a fit on the original DataFrame (with `stratum` / `psu` columns present) against a fit on the same data with those columns physically dropped. If the estimator ever silently reads those columns by naming convention the two fits would diverge and the abs=1e-12 ATT check would fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T22:19:52Z

/ai-review

github-actions · 2026-04-22T22:30:56Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 826b1a8e722960a0936da30785fbdf5f4ed8cf97

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed PR surface. The remaining issues are P2/P3 only. Runtime verification was static-only because this environment does not have numpy, pandas, or pytest.

Executive Summary

The core SDID estimator change looks methodologically coherent: the bootstrap path now refits both weight vectors per draw, preserves the existing SE aggregation, and keeps NaN-safe inference dispatch in the fitted results surface. diff_diff/synthetic_did.py:L590-L657 diff_diff/synthetic_did.py:L830-L1057
The survey-support regression is explicit rather than silent, and it is properly documented in the methodology registry and tracked in TODO.md, so it is informational under this rubric. diff_diff/synthetic_did.py:L283-L336 docs/methodology/REGISTRY.md:L1549-L1550 TODO.md:L103-L104
The prior informational survey-test issue from the last review appears resolved: the pweight-only regression now compares a DataFrame with stratum/psu columns against one where those columns are actually dropped. tests/test_survey_phase5.py:L227-L270
Two non-blockers remain: the new aggregate Frank-Wolfe warning path does not actually surface on the Rust-backed solver, and one retained null-calibration test still encodes the deleted fixed-weight bootstrap behavior. diff_diff/synthetic_did.py:L953-L977 diff_diff/synthetic_did.py:L1035-L1049 tests/test_methodology_sdid.py:L2547-L2589

Methodology

Severity P3. Impact: No unmitigated methodology defect. The affected method is SyntheticDiD bootstrap variance. The refit path in diff_diff/synthetic_did.py:L830-L1057 refits both ω_b and λ_b, keeps the sqrt((r-1)/r) aggregation, and still routes bootstrap/jackknife inference through safe_inference() in diff_diff/synthetic_did.py:L643-L657, matching the local registry at docs/methodology/REGISTRY.md:L1497-L1551. Upstream synthdid also documents bootstrap/jackknife/placebo as Algorithms 2/3/4, and its bootstrap source calls back into synthdid_estimate() while synthdid_estimate() defaults update.omega / update.lambda to refitting-from-initialization when weights are supplied. Concrete fix: none. (synth-inference.github.io)
Severity P3. Impact: The SDID survey-bootstrap/full-design regression is explicit and mitigated by documentation/tracking: any survey + variance_method="bootstrap" fit and any full-design SDID fit now raises NotImplementedError, and that deviation is both documented and tracked. Concrete fix: none for approval; the weighted-Frank-Wolfe survey-bootstrap follow-up is already queued. diff_diff/synthetic_did.py:L283-L336 docs/methodology/REGISTRY.md:L1549-L1550 TODO.md:L103-L104

Code Quality

Severity P2. Impact: The new aggregate Frank-Wolfe non-convergence warning is effectively disabled on Rust-backed execution. _bootstrap_se() only counts captured Python UserWarnings in diff_diff/synthetic_did.py:L953-L977, but the refit helpers still route both FW passes through _sc_weight_fw() in diff_diff/utils.py:L1499-L1523 and diff_diff/utils.py:L1597-L1615, and _sc_weight_fw() returns directly to the Rust solver whenever HAS_RUST_BACKEND is true in diff_diff/utils.py:L1338-L1353. The registry already notes that the Rust solver is silent on non-convergence, so the summary at diff_diff/synthetic_did.py:L1039-L1049 can stay at zero even when draws hit max_iter. Concrete fix: thread convergence status out of _rust_sc_weight_fw through the weight helpers, or route this warning path through _sc_weight_fw_numpy, and add a regression test for the Rust-backed case. docs/methodology/REGISTRY.md:L1528-L1528

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: External parity anchoring for the refit bootstrap remains deferred. The old fixed-weight R fixture is deleted, and the replacement Julia/R parity anchor is only tracked in TODO.md, so this is informational rather than blocking. Concrete fix: land the queued cross-language parity harness already listed in TODO.md:L104-L104.

Security

No findings.

Documentation/Tests

Severity P2. Impact: [Newly identified] test_bootstrap_p_value_null_calibration still documents and asserts the removed fixed-weight bootstrap behavior. Its docstring says the test is for the “fixed-weight regime” and its lower-bound assertion rejection_rate > 0.05 assumes systematic over-rejection, but the new registry and committed coverage artifact now describe near-nominal refit-bootstrap calibration. That leaves the test description wrong today and can make the assertion flaky or fail on correct code. Concrete fix: rewrite the test around its actual regression target, namely “not collapsed near 0 and not catastrophically large,” and update the docstring/comments to the refit-bootstrap semantics. tests/test_methodology_sdid.py:L2547-L2589 docs/methodology/REGISTRY.md:L1552-L1570
No other findings. The prior informational survey-test issue appears addressed in tests/test_survey_phase5.py:L227-L270.

Prior behavior: ``_bootstrap_se`` tallied Frank-Wolfe non-convergence via ``warnings.catch_warnings``, but the Rust FW entry point is silent on ``max_iter`` exhaustion (only the pure-NumPy path called ``warn_if_not_converged``). On the default Rust backend the aggregate warning at the end of the bootstrap loop therefore never fired, even when draws did not converge — a silent failure. Fix: thread an explicit convergence bool out of the Rust solver. Rust (``rust/src/weights.rs``, ``rust/src/lib.rs``) - ``sc_weight_fw_gram`` / ``sc_weight_fw_standard`` now set and return ``converged = true`` on a min-decrease break, ``false`` otherwise. - ``sc_weight_fw_internal`` returns ``(Array1<f64>, bool)``. - Existing ``sc_weight_fw`` pyfunction destructures and drops the bool, preserving its ABI for the rank-selection heuristic in ``prep.py`` and for any third-party consumer. - New pyfunction ``sc_weight_fw_with_convergence`` returns the ``(array, bool)`` tuple, wrapping the same internal solver. - Internal helpers ``compute_time_weights_internal`` / ``compute_sdid_unit_weights_internal`` destructure the inner calls and still return ``Array1<f64>`` (their pyfunctions discard convergence — Python callers that need it use the Python two-pass dispatcher). Python (``diff_diff/utils.py``, ``diff_diff/_backend.py``) - Import the new Rust entry point as ``_rust_sc_weight_fw_with_convergence``. - ``_sc_weight_fw`` / ``_sc_weight_fw_numpy`` gain a ``return_convergence=False`` kwarg. Default path is unchanged; with the flag set, return ``(weights, converged)``. - ``compute_sdid_unit_weights`` / ``compute_time_weights`` gain the same kwarg and propagate the AND of the two FW passes (pre-sparsify + main). When the flag is set, the Rust top-level fast-path is skipped (it is silent on non-convergence) in favor of the Python two-pass dispatcher; inner FW calls still dispatch to the Rust solver via ``sc_weight_fw_with_convergence``. Consumers (``diff_diff/synthetic_did.py::_bootstrap_se``) - Replace the ``warnings.catch_warnings`` dance with an explicit ``return_convergence=True`` call on both helpers. ``fw_nonconvergence _count`` now increments based on the AND of ``omega_converged`` and ``lambda_converged`` per draw — works under both Rust and pure- Python backends. Regression test (``tests/test_methodology_sdid.py::TestBootstrapSE:: test_bootstrap_fw_nonconvergence_warning_fires_under_rust``) - Monkeypatches ``_rust_sc_weight_fw_with_convergence`` to always report ``converged=False`` and asserts the aggregate FW non- convergence ``UserWarning`` fires on an otherwise-nominal SDID bootstrap fit. Skipped when the Rust backend is unavailable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T22:46:50Z

/ai-review

github-actions · 2026-04-22T22:55:28Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c0d089bf9d6344c079fbba7f7a00717888d1f463

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed PR surface. On this re-review, the earlier Rust non-convergence warning issue and the BR inference-label concern both look addressed. The remaining issues are P2/P3 only.

Executive Summary

The affected method is SyntheticDiD bootstrap variance. The implementation now refits both ω and λ per bootstrap draw, keeps the established sqrt((r-1)/r) × sd(...) SE aggregation, and still routes analytical inference through safe_inference(). I did not find an unmitigated methodology defect.
The previous Rust-warning concern appears resolved by the new convergence-return plumbing in diff_diff/utils.py:1350, diff_diff/synthetic_did.py:958, and rust/src/weights.rs:518.
The previous BR allow-list concern also appears resolved; unchanged diff_diff/business_report.py:602 already recognizes SDID variance_method values, and the new regression test at tests/test_business_report.py:4689 covers the alpha-override path.
One previous P2 remains unresolved: tests/test_methodology_sdid.py:2604 still documents and bounds bootstrap null calibration as if the deleted fixed-weight bootstrap were still the live behavior.
Static-only review here; I could not run pytest because the command is unavailable in this environment.

Methodology

Affected method: SyntheticDiD bootstrap variance. Cross-checking docs/methodology/REGISTRY.md:1497, diff_diff/synthetic_did.py:948, diff_diff/synthetic_did.py:1053, and diff_diff/synthetic_did.py:644 against the cited SDID material and the current synthdid bootstrap flow did not reveal an unmitigated methodology defect. citeturn0search0turn3view0turn3view1

Severity P3. Impact: The survey-bootstrap capability regression is explicitly documented and tracked in docs/methodology/REGISTRY.md:1549 and TODO.md:107, so under this rubric it is informational rather than blocking. Concrete fix: none for approval.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: The deleted cross-language parity fixture leaves an external parity anchor as follow-up work, but that follow-up is explicitly tracked in TODO.md:108, so it is not a blocker. Concrete fix: none for approval.

Security

No findings.

Documentation/Tests

Severity P2. Impact: [Previous finding unresolved] tests/test_methodology_sdid.py:2604 still describes bootstrap as the removed fixed-weight regime and hard-codes rejection_rate > 0.05 at α=0.05. The updated calibration table in docs/methodology/REGISTRY.md:1557 and docs/methodology/REGISTRY.md:1566 now says refit bootstrap is near nominal, so this test can fail on correct behavior or bias future changes toward anti-conservative output. Concrete fix: rewrite the docstring/comments to the refit-bootstrap contract and replace the lower bound with a calibration-agnostic regression guard, e.g. “not collapsed near 0 and not catastrophically high,” or a dispersion/quantile check that specifically catches the old dispatch bug.
Severity P3. Impact: [Newly identified] The Methodology Registry still says the Rust backend is silent on Frank-Wolfe non-convergence in docs/methodology/REGISTRY.md:1528, but this PR now threads explicit convergence status through Rust in rust/src/weights.rs:518, diff_diff/utils.py:1350, and emits the aggregate warning in diff_diff/synthetic_did.py:1041. Since REGISTRY.md is the methodology source of truth, that note is now stale. Concrete fix: update the edge-case bullet to reflect the new Rust convergence flag and the aggregated bootstrap warning behavior.

…W note P2 (Documentation/Tests) — ``test_bootstrap_p_value_null_calibration`` at ``tests/test_methodology_sdid.py:2604`` was still framed around the deleted fixed-weight regime and encoded an anti-conservative lower bound ``rejection_rate > 0.05`` that biased the test toward keeping over-rejection. Rename to ``test_bootstrap_p_value_null_dispersion`` and replace the directional calibration assertion with a calibration-agnostic dispersion guard: ``np.std(p_values) > 0.10`` catches the pre-fix p-clustering dispatch bug (p≈0.5 on every seed under the old bug), and a loose band ``0.01 <= rejection_rate <= 0.40`` catches both SE-collapse and SE-explosion. Matches this session's commit c0d089b by removing the remaining lingering fixed-weight framing in the test suite. Tested locally: 100-seed refit bootstrap passes with comfortable margin. P3 (newly identified) — REGISTRY.md §Frank-Wolfe non-convergence edge case (line 1528) still said the Rust backend silently returns the final iterate and cited a follow-up. This session's commit c0d089b already landed the Rust-side convergence flag, Python ``return_convergence=True`` kwargs on the weight helpers, and the aggregate ``UserWarning`` in ``_bootstrap_se``. Update the bullet to describe the new behavior: bootstrap path now uses ``sc_weight_fw_with_convergence`` (Rust) / numpy ``return_convergence=True`` and aggregates a single summary warning above the 5% threshold; standalone callers that do not opt into convergence tracking retain the legacy silent-Rust / warning-numpy asymmetry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T22:57:56Z

/ai-review

github-actions · 2026-04-22T23:09:30Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9280cddab051623d6a9b5d30c9eb006cecf81397

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed PR surface. The SDID refit-bootstrap implementation itself looks methodologically consistent with the cited SDID paper/official synthdid variance surface, and the previous re-review items around the null-calibration test and the Rust non-convergence note appear addressed. Static review only; I did not run pytest or the coverage Monte Carlo harness here.

Executive Summary

The affected method is SyntheticDiD variance estimation, specifically the refit variance_method="bootstrap" path; I did not find a silent correctness bug in the changed estimator/SE/inference flow.
The prior test-related concern looks resolved: the slow bootstrap null-calibration guard is now calibration-agnostic instead of baking fixed-weight over-rejection into the expected behavior.
The survey-bootstrap capability regression is explicitly documented in REGISTRY.md and tracked in TODO.md, so it is informational rather than blocking under this rubric.
Remaining findings are P3-only documentation/UX drift: incorrect R-default wording, one source-paper overstatement in the new coverage narrative, one stale registry test reference, and one outdated survey warning string.

Methodology

Affected method: SDID variance estimation. The changed bootstrap path in diff_diff/synthetic_did.py:L590-L657, diff_diff/synthetic_did.py:L830-L1059, diff_diff/utils.py:L1301-L1726, and rust/src/weights.rs:L125-L558 re-estimates both ω and λ on each bootstrap draw, keeps the established sqrt((r-1)/r) * sd(...) SE aggregation, and still routes bootstrap/jackknife inference through safe_inference(). That is consistent with the SDID source reference and the official synthdid variance implementation, which documents bootstrap/jackknife/placebo as Algorithms 2/3/4 and implements bootstrap by resampling all units, renormalizing omega, and re-entering synthdid_estimate with stored opts. (aeaweb.org)

Severity P3. Impact: diff_diff/synthetic_did.py:L50-L61 and METHODOLOGY_REVIEW.md:L517-L519 now make the variance defaults internally inconsistent: the placebo bullet/review note still say placebo is “R’s default,” while the same PR also describes bootstrap as matching “R’s default,” and upstream synthdid docs/source show vcov() defaults to bootstrap. This does not change runtime behavior, but it misstates the reference implementation and obscures that the library’s placebo default is a divergence from R rather than a match. Concrete fix: change those lines to say placebo is the library default, or add a REGISTRY.md Note (deviation from R) if that divergence is intentional. (synth-inference.github.io)
Severity P3. Impact: docs/methodology/REGISTRY.md:L1566-L1570 says the PR’s smaller-panel jackknife over-rejection is “in line” with Arkhangelsky et al. §6.3’s “98% / 93% coverage pattern.” The cited paper preview presents mixed evidence instead: the iid design is slightly conservative at 98% coverage, while the dependent-error design is 93% coverage. The current sentence overstates what the cited source supports. Concrete fix: rephrase this as mixed jackknife calibration in the paper, rather than uniformly anti-conservative support. (aeaweb.org)

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3. Impact: diff_diff/synthetic_did.py:L1128-L1133 and diff_diff/synthetic_did.py:L1220-L1224 still tell users to “Consider using variance_method='bootstrap'” from placebo failure paths. After this PR, bootstrap rejects every survey design, so that guidance is now wrong on the newly restricted survey-weighted path. Concrete fix: branch those warning strings on survey presence and suggest jackknife / more controls for survey users.

Tech Debt

Severity P3. Impact: The SDID survey-bootstrap capability regression is explicitly documented in docs/methodology/REGISTRY.md:L1549-L1550 and tracked in TODO.md:L107-L108, so under this rubric it is properly deferred rather than blocking. Concrete fix: none for approval.

Security

No findings.

Documentation/Tests

Severity P3. Impact: docs/methodology/REGISTRY.md:L1570 still points readers to TestPValueSemantics::test_bootstrap_p_value_null_calibration, but the PR renamed the actual regression guard to tests/test_methodology_sdid.py:L2604-L2657. The registry now references a non-existent test when describing the calibration safeguard. Concrete fix: update the registry reference to test_bootstrap_p_value_null_dispersion.

Four P3-only items from R8 CI review: 1. Correctly attribute R's default `vcov()` method: - diff_diff/synthetic_did.py:53 docstring previously claimed placebo was "R's default". R's `synthdid::vcov()` actually defaults to `method="bootstrap"`. Reword to describe placebo as the library default with a rationale paragraph (survey availability, perf) and cross- reference to the REGISTRY Note below. - METHODOLOGY_REVIEW.md item 5 said the same incorrect thing. Rewrite to frame the default as a deliberate library deviation with the same two-reason rationale. 2. Add a REGISTRY.md Note (default variance_method deviation from R) that documents the rationale explicitly (survey availability + perf) so the AI reviewer recognizes the deviation as documented rather than as an accidental contradiction between the docstring and R's actual default. 3. Soften the placebo-failure fallback guidance in `_placebo_variance_se`: the previous strings recommended `variance_method="bootstrap"`, which now raises `NotImplementedError` on every survey design. Branch on `w_control is not None` (survey fit) to recommend jackknife + adding controls for survey users and keep the bootstrap + jackknife + more-controls recommendation for non-survey users. 4. Fix the REGISTRY pointer to the slow dispersion guard: reference `TestPValueSemantics::test_bootstrap_p_value_null_dispersion` (the current name) instead of the pre-rename `test_bootstrap_p_value_null_calibration`, and describe the new contract (calibration-agnostic dispersion + loose rejection-rate band). 5. Rephrase the REGISTRY coverage-MC narrative about jackknife anti-conservatism: the paper's AER §6.3 shows mixed jackknife evidence (98% iid — slightly conservative; 93% AR(1) — slightly anti-conservative). Our observed anti-conservatism aligns with the AR(1) branch of the paper's evidence, not a uniform "in line" pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T23:16:41Z

/ai-review

github-actions · 2026-04-22T23:25:43Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 710f96654d443e7a6546febe650e0e7cc97e7578

Overall Assessment

✅ Looks good

Executive Summary

No unmitigated P0/P1 findings in the changed SDID estimator / SE / inference path.
The refit bootstrap implementation now matches the documented methodology surface and the official synthdid variance contract, rather than the old fixed-weight shortcut.
The prior re-review P3s appear addressed: default-vs-R wording, the jackknife coverage narrative, the stale registry test reference, and the survey-aware placebo fallback messaging are all updated in the changed files.
The remaining items are P3-only: one user-facing runtime-guidance inconsistency, plus tracked tech-debt items around survey-bootstrap composition and restoring an external parity anchor.
This was a static review only; I could not execute tests in this workspace because numpy and pytest are not installed.

Methodology

No unmitigated findings. The refit bootstrap path in diff_diff/synthetic_did.py:L834-L1063, together with the warm-start / convergence plumbing in diff_diff/utils.py:L1301-L1715 and rust/src/weights.rs:L121-L555, is consistent with the methodology registry in docs/methodology/REGISTRY.md:L1497-L1552 and with the official synthdid surface: vcov() documents bootstrap/jackknife/placebo as Algorithms 2/3/4, defaults to bootstrap, and the R bootstrap code renormalizes omega then re-enters synthdid_estimate() with stored opts, while synthdid_estimate() treats update.omega / update.lambda as re-estimation from the passed weights as initializations. (synth-inference.github.io)

Code Quality

No findings.

Performance

No code-level findings.

Maintainability

No findings.

Tech Debt

Severity P3 (tracked). Impact: diff_diff/synthetic_did.py:L315-L341, docs/methodology/REGISTRY.md:L1549-L1551, and TODO.md:L107-L108 explicitly document that paper-faithful bootstrap currently rejects all survey designs; full-design survey users therefore have no SDID variance path in this release, and pweight-only users must use placebo/jackknife. Concrete fix: none for approval; the weighted-Frank-Wolfe / Rao-Wu composition follow-up is already tracked.
Severity P3 (tracked). Impact: the new external parity anchor is deferred in TODO.md:L108-L108, and the registry now explicitly avoids claiming bit-level cross-language bootstrap parity in docs/methodology/REGISTRY.md:L1506-L1508. Concrete fix: none for approval; land the tracked R/Julia parity harness later.

Security

No findings.

Documentation/Tests

Severity P3. Impact: user-facing runtime guidance for refit bootstrap is internally inconsistent. The changed surfaces now say ~5–30× in CHANGELOG.md:L15-L15, ~5–30× in diff_diff/synthetic_did.py:L54-L57, ~10–100× in diff_diff/power.py:L720-L721, and ~2–4 hours for the coverage run in benchmarks/python/coverage_sdid.py:L13-L17, while the registry says ~15–40 min in docs/methodology/REGISTRY.md:L1569-L1571. This is not a correctness bug, but it will give users conflicting expectations about the cost of the new bootstrap path. Concrete fix: standardize all surfaces on one benchmark-backed wording, or collapse them to “order-of-magnitude slower; see performance docs.”

Single actionable P3 from R9 CI review: user-facing runtime wording for refit bootstrap had diverged across surfaces, giving conflicting expectations about the cost of the new bootstrap path: - CHANGELOG.md and diff_diff/synthetic_did.py said ~5-30x slower. - diff_diff/power.py said ~10-100x slower (two sites). - docs/choosing_estimator.rst said ~10-100x slower. - docs/performance-scenarios.md said ~10-100x slower. - docs/methodology/REGISTRY.md coverage-MC block said ~10-100x slower. - docs/tutorials/03_synthetic_did.ipynb and docs/tutorials/18_geo_experiments.ipynb said ~10-100x slower. - benchmarks/python/coverage_sdid.py said the 500-seed MC run takes ~2-4 hours, while REGISTRY.md said ~15-40 min (the actually-observed wall-clock; aer63 is ~37 min, balanced + unbalanced ~2 min combined). Unify on "~5-30x slower than placebo (panel-size dependent)" for the per-fit slowdown (the warm-start plumbing closed the gap vs the pre- warm-start cold-start estimate of 10-100x) and on "~15-40 min" for the coverage MC wall-clock. The CHANGELOG entry already notes the 10-100x figure as a historical "prior estimate" — left as-is so the release notes continue to explain the revision. Also fix two tutorial surfaces that still called placebo "R's default" (tutorial 03, sections 7 and 10). R's default is bootstrap; placebo is the library default per the REGISTRY Note added in 710f966. Reword to describe placebo as the library default with the rationale pointer. Verified: 353 tests pass across test_methodology_sdid, test_power, test_guides (UTF-8 fingerprint preserved). Tutorial-18 nbmake drift guards unaffected because the change is markdown-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T23:45:58Z

/ai-review

github-actions · 2026-04-22T23:53:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: dc2045fb781e7d16c6c3f768d9f6a4535d51d925

Overall Assessment

✅ Looks good

Static review only; I could not execute tests in this workspace because numpy, pandas, and pytest are not installed.

Executive Summary

No unmitigated P0/P1 findings in the changed SyntheticDiD estimator / SE / inference path.
The prior P3 on inconsistent slowdown/runtime guidance appears addressed across the changed surfaces.
The survey-bootstrap capability regression and the loss of the old cross-language parity fixture are both explicitly documented/tracked, so they remain P3-only.
[Newly identified] Two unchanged survey docs still advertise the removed SDID Rao-Wu bootstrap path; this is a P3 documentation cleanup.

Methodology

No unmitigated findings. The refit bootstrap path in diff_diff/synthetic_did.py:594, diff_diff/synthetic_did.py:846, and docs/methodology/REGISTRY.md:1497 is consistent with the source-material contract: official synthdid docs map bootstrap / jackknife / placebo to Algorithms 2 / 3 / 4 and default vcov() to bootstrap; bootstrap_sample() passes renormalized omega plus stored opts back into synthdid_estimate(); and synthdid_estimate() treats supplied weights as initializations while update.omega / update.lambda remain enabled. citeturn1view0turn1view1turn2view0

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3 (tracked). Impact: the survey-support regression is explicit and documented: diff_diff/synthetic_did.py:315 and diff_diff/synthetic_did.py:334 now reject full designs and any bootstrap-plus-survey combination, while docs/methodology/REGISTRY.md:1549 and TODO.md:107 track the weighted-Frank-Wolfe / Rao-Wu follow-up. Concrete fix: none for approval; land the tracked weighted-FW survey-bootstrap work later.
Severity P3 (tracked). Impact: the old fixed-weight R fixture was removed, so the refit bootstrap no longer has an external parity anchor in-tree; the follow-up is explicitly tracked in TODO.md:108. Concrete fix: none for approval; add the planned Julia/R parity harness later.

Security

No findings.

Documentation/Tests

Severity P3 [Newly identified]. Impact: docs/tutorials/16_survey_did.ipynb:1109 and docs/methodology/survey-theory.md:725 still tell users that SyntheticDiD has Rao-Wu survey bootstrap support, but the new guards in diff_diff/synthetic_did.py:315 and diff_diff/synthetic_did.py:334 now raise NotImplementedError for that path. Users following the survey tutorial will hit a removed code path. Concrete fix: update both docs surfaces to remove SyntheticDiD from the Rao-Wu support list and redirect readers to pweight-only variance_method='placebo' / 'jackknife' plus the tracked survey-bootstrap TODO.

The Pure Python Fallback CI job failed this test at rel-diff 0.5310 > 0.40 tolerance. Root cause is test-infrastructure, not a correctness regression. ci_params.bootstrap(min_n=...) silently caps min_n at 49 in pure-Python mode to keep CI fast (see tests/conftest.py:210); the test's 0.40 tolerance was explicitly calibrated for B∈[100, 200] per its docstring comment. At B=49 the bootstrap SE is not yet converged to the placebo SE (rel-diff 0.5310 at B=49; 0.3856 at B=100; 0.2708 at B=200 on the same seed), so the failure is MC-noise, not a regression. The 15 Rust-backed matrix jobs (macOS/Linux x86/Linux ARM/Windows × 3 Python versions) all run the test at the full B=200 and pass with comfortable margin — the regression guard is still exercised on the default user install path. Skip under pure-Python mode with an explicit rationale citing the min_n cap and the Rust-backed coverage that preserves the contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-23T01:19:36Z

/ai-review

github-actions · 2026-04-23T01:29:24Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b123c2b763ec0d92584c8e5f1130a92140e3d343

Overall Assessment

✅ Looks good

Static diff review only; numpy, pandas, and pytest are not installed in this workspace, so I could not execute the updated tests.

Executive Summary

No unmitigated P0/P1 findings in the changed SyntheticDiD estimator / SE / inference path. The refit bootstrap implementation in diff_diff/synthetic_did.py:846 matches the updated contract in docs/methodology/REGISTRY.md:1497 and METHODOLOGY_REVIEW.md:504.
The PR repurposes variance_method="bootstrap" to mean the paper-faithful refit bootstrap rather than adding a new enum. That is a real behavior change, but it is explicitly documented in CHANGELOG.md:15 and docs/methodology/REGISTRY.md:1497, so it is informational rather than blocking.
Severity P2: the new aggregate Frank-Wolfe non-convergence warning can count discarded draws in the numerator while dividing by valid draws, so the documented 5% threshold can be overstated. See diff_diff/synthetic_did.py:982, diff_diff/synthetic_did.py:993, and diff_diff/synthetic_did.py:1045.
The new tests cover the right surfaces for this change set: bootstrap refit behavior, survey rejection guards, set_params() rollback, BR labeling, and the coverage artifact schema in tests/test_methodology_sdid.py:501, tests/test_survey_phase5.py:174, and tests/test_business_report.py:4656.
[Previously identified, unresolved] Severity P3: survey docs outside the changed files still advertise SyntheticDiD Rao-Wu bootstrap support, conflicting with the new runtime guards.

Methodology

No unmitigated findings. The changed refit bootstrap path at diff_diff/synthetic_did.py:846 is consistent with the updated methodology registry at docs/methodology/REGISTRY.md:1497, and bootstrap/jackknife inference still routes through safe_inference() at diff_diff/synthetic_did.py:648, so I did not find a new P0/P1 NaN/inference regression.

Code Quality

Severity P2. Impact: In _bootstrap_se, fw_nonconvergence_count is incremented before the draw is accepted into bootstrap_estimates, so a draw that later fails the np.isfinite(tau) gate can still count toward the warning numerator while the denominator is n_successful valid draws. That does not change the reported SE, but it can over-trigger or overstate the documented “share of valid bootstrap draws” warning contract. Concrete fix: increment the counter only after the draw passes the finite-tau gate, or keep a separate fw_nonconvergence_valid_count for the warning path. References: diff_diff/synthetic_did.py:982, diff_diff/synthetic_did.py:993, diff_diff/synthetic_did.py:1045, docs/methodology/REGISTRY.md:1508.

Performance

No findings. The refit-bootstrap slowdown is explicit and documented, not an accidental regression. See CHANGELOG.md:15.

Maintainability

No findings.

Tech Debt

Severity P3 (tracked). Impact: SDID survey-bootstrap support is intentionally removed and front-door rejected in this release; that is a capability regression, but it is documented and tracked rather than silently wrong. Concrete fix: none for approval; implement the weighted-Frank-Wolfe + Rao-Wu composition follow-up tracked in TODO.md:107 and described in docs/methodology/REGISTRY.md:1551.
Severity P3 (tracked). Impact: deleting the old R bootstrap fixture removes the only external cross-language parity anchor for SDID bootstrap math; protection now relies on internal baselines and characterization tests. Concrete fix: none for approval; add the planned Julia/R parity harness tracked in TODO.md:108.

Security

No findings.

Documentation/Tests

Severity P3 [Previously identified, unresolved]. Impact: docs/methodology/survey-theory.md:725 and docs/tutorials/16_survey_did.ipynb:1109 still say SyntheticDiD has Rao-Wu survey-bootstrap support, but the estimator now raises NotImplementedError for that path at diff_diff/synthetic_did.py:315 and diff_diff/synthetic_did.py:334. Concrete fix: remove SyntheticDiD from both Rao-Wu support lists and redirect readers to pweight-only variance_method="placebo" / "jackknife" plus the tracked TODO.

R10 CI review found two items on top of the previous ✅ Looks good. P2 Code Quality — aggregate Frank-Wolfe non-convergence warning numerator/denominator mismatch. In ``_bootstrap_se``, ``fw_nonconvergence_count`` was incremented before the draw cleared the ``np.isfinite(tau)`` gate. A draw that failed FW convergence AND then produced non-finite τ would count toward the warning numerator while the denominator is ``n_successful`` (draws that cleared the finite-τ gate). That does not affect the reported SE, but it can overstate the documented "share of valid bootstrap draws" warning contract and cause the warning to over-trigger. Fix: move the increment inside the ``if np.isfinite(tau)`` block so the numerator only counts draws that also contribute to the SE. A draw failing the finite-τ gate is retried upstream and should not inflate the non-convergence rate. P3 Documentation (previously unresolved) — two survey-cross-reference docs still advertised SyntheticDiD Rao-Wu bootstrap support, which the estimator now rejects at fit-time with NotImplementedError: - ``docs/methodology/survey-theory.md:725`` — rewrite the Rao-Wu bullet to exclude SDID explicitly, with a pointer to the REGISTRY sketch for the deferred weighted-FW composition and to pweight-only placebo/jackknife as the available SDID variance alternatives. - ``docs/tutorials/16_survey_did.ipynb`` cell-35-f1ef376c — update the support-matrix table so SDID's row reads "pweight only (placebo / jackknife)" with bootstrap struck out, and add a "Note on SyntheticDiD" below explaining which methods accept pweight-only and why bootstrap rejects all survey designs (weighted- FW derivation tracked in TODO.md). Test coverage unchanged: TestBootstrapSE ran the full 7 under Rust with 48-of-50 non-convergence warning still firing on the regression test, confirming the warning numerator still tallies correctly after the gate-order change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-23T21:25:32Z

/ai-review

github-actions · 2026-04-23T21:33:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5eadcb6dbd1c11aca9a70aeb5a7d54c0ff09296b

Overall Assessment

✅ Looks good

Static diff review only; numpy, pandas, and pytest are not installed in this workspace, so I could not execute the updated tests.

Executive Summary

No unmitigated P0/P1 findings in the changed SDID estimator / inference path.
The refit bootstrap implementation in diff_diff/synthetic_did.py:L594-L608 and diff_diff/synthetic_did.py:L846-L1065 now re-estimates ω and λ on each bootstrap draw, which aligns with Algorithm 2’s “compute the SDID estimator based on (Y^(b), W^(b))” requirement and with R’s bootstrap_sample() / synthdid_estimate() warm-start contract. (nber.org)
The previous P2 on Frank-Wolfe non-convergence accounting appears addressed: fw_nonconvergence_count now increments only after a finite draw is accepted (diff_diff/synthetic_did.py:L986-L998), so numerator and denominator both reference valid draws.
The prior survey-doc mismatch also appears resolved in the changed docs (docs/methodology/survey-theory.md:L725-L735, docs/tutorials/16_survey_did.ipynb:L1090, docs/choosing_estimator.rst:L781-L821).
The removed external bootstrap parity fixture is not a blocker because the follow-up is explicitly tracked in TODO.md:L108; regression coverage on the changed surfaces is otherwise solid (tests/test_methodology_sdid.py:L500-L710, tests/test_survey_phase5.py:L179-L360, tests/test_business_report.py:L4658-L4702).

Methodology

No unmitigated P0/P1 findings.
Severity P3 (informational). Impact: variance_method="bootstrap" is now a materially different user-facing method than before; existing fits will get different SE / p-value / CI outputs. This is explicitly documented in CHANGELOG.md:L15-L21 and docs/methodology/REGISTRY.md:L1497-L1552, so it is not a defect. Concrete fix: none for approval. The implementation at diff_diff/synthetic_did.py:L594-L608 and diff_diff/synthetic_did.py:L846-L1065 matches the paper / R contract for recomputing SDID on each bootstrap draw. (nber.org)

Code Quality

No findings.

Performance

No findings. The slowdown is explicitly documented as an intentional consequence of replacing the removed fixed-weight shortcut with the refit path (CHANGELOG.md:L15-L21, docs/methodology/REGISTRY.md:L1508-L1509).

Maintainability

No findings. Extracting _validate_config() and making set_params() transactional improves constructor / setter contract parity (diff_diff/synthetic_did.py:L188-L203, diff_diff/synthetic_did.py:L1479-L1511).

Tech Debt

Severity P3 (tracked in TODO.md). Impact: deleting benchmarks/R/generate_sdid_bootstrap_parity_fixture.R and tests/data/sdid_bootstrap_indices_r.json removes the only external cross-language parity anchor for SDID bootstrap in this PR. Concrete fix: none for approval; the replacement parity harness is already tracked in TODO.md:L108.
Severity P3 (tracked in TODO.md). Impact: SDID bootstrap no longer supports any survey design, and full-design SDID has no variance path this release. That is a real capability regression, but it is front-door rejected, documented in docs/methodology/REGISTRY.md:L1549-L1551, and tracked in TODO.md:L107, so it is not a blocker. Concrete fix: none for approval; implement the weighted-Frank-Wolfe + Rao-Wu follow-up.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new return_convergence=True helper contract is only partially documented; compute_time_weights() and compute_sdid_unit_weights() explain the new parameter but their Returns sections still advertise only np.ndarray, not the tuple return path. Concrete fix: update diff_diff/utils.py:L1462-L1510 and diff_diff/utils.py:L1598-L1643 to mirror _sc_weight_fw’s union-return documentation.
Otherwise no findings. The changed tests cover the main regression surfaces: bootstrap refit behavior / non-convergence (tests/test_methodology_sdid.py:L500-L710), survey rejection and pweight-only fallbacks (tests/test_survey_phase5.py:L179-L360), BR inference labeling (tests/test_business_report.py:L4658-L4702), and the coverage artifact schema (tests/test_methodology_sdid.py:L2976-L3005).

R11 CI review flagged that compute_time_weights and compute_sdid_unit _weights describe the new return_convergence parameter but their Returns sections still advertise only np.ndarray, not the tuple return. Update both docstrings to mirror _sc_weight_fw's union-return shape: Returns now read "np.ndarray or Tuple[np.ndarray, bool]" with an explicit description of the two-pass AND convergence flag (True iff both pre-sparsify and main FW passes hit the min-decrease break; False if either hit max_iter). Also promote the return_convergence kwarg description from the implementation-detail placement to the main Parameters block with the same contract note as _sc_weight_fw so standalone callers see the legacy-ABI preservation guarantee and the Rust top-level fast-path skip. Docs-only change; no code or test changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-23T21:50:06Z

/ai-review

github-actions · 2026-04-23T21:57:33Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ec34beb8c5f98db72001cfb534ac4d5446e9be60

Overall Assessment

✅ Looks good

Static diff review only; numpy, pandas, and pytest are not installed in this workspace, so I could not execute the updated tests or the Monte Carlo script.

Executive Summary

No unmitigated P0/P1 findings in the changed SyntheticDiD estimator / inference path.
The refit bootstrap now re-estimates both ω and λ on each pairs-bootstrap draw, and bootstrap/jackknife inference still routes through safe_inference(), which matches the current Methodology Registry contract for Arkhangelsky et al. Algorithm 2 / R-default vcov(method="bootstrap"). diff_diff/synthetic_did.py:L594-L661, diff_diff/synthetic_did.py:L846-L1067, docs/methodology/REGISTRY.md:L1497-L1552
Previous re-review concerns appear addressed: Frank-Wolfe non-convergence is now surfaced through explicit Rust/Python convergence flags instead of warning capture, and the tuple-return helper contracts are now documented. diff_diff/synthetic_did.py:L952-L998, diff_diff/utils.py:L1301-L1382, diff_diff/utils.py:L1462-L1667, rust/src/weights.rs:L518-L555
Survey support is now internally consistent across code, docs, and tests: full designs are front-door rejected on all SDID variance methods, and pweight-only surveys are limited to placebo/jackknife. This regression is documented and tracked, so it is informational only. diff_diff/synthetic_did.py:L315-L342, docs/choosing_estimator.rst:L785-L823, tests/test_survey_phase5.py:L179-L359, TODO.md:L107-L108
Remaining issues are P3 only: one stale fallback message on the power-analysis surface, and weak provenance metadata on the committed coverage artifact.

Methodology

No findings. The changed estimator math, SE formula, retry-to-B contract, survey guards, and bootstrap/jackknife p-value dispatch match the current Registry description. diff_diff/synthetic_did.py:L594-L661, diff_diff/synthetic_did.py:L846-L1067, docs/methodology/REGISTRY.md:L1497-L1552

Code Quality

No findings.

Performance

No findings. The slowdown from fixed-weight bootstrap to per-draw refit is explicit and documented rather than silent. CHANGELOG.md:L15-L21, docs/performance-scenarios.md:L237-L249

Maintainability

No findings. Extracting _validate_config() and making set_params() transactional improves constructor/setter parity. diff_diff/synthetic_did.py:L186-L203, diff_diff/synthetic_did.py:L1479-L1511

Tech Debt

Severity P3 (tracked in TODO.md). Impact: SDID no longer has any survey-aware variance path for strata/PSU/FPC, and even pweight-only survey fits cannot use bootstrap in this release. This is front-door rejected and documented, so it is not a blocker. Concrete fix: none for approval; implement the tracked weighted-Frank-Wolfe + Rao-Wu composition follow-up. diff_diff/synthetic_did.py:L315-L342, docs/methodology/REGISTRY.md:L1549-L1551, TODO.md:L107-L107
Severity P3 (tracked in TODO.md). Impact: deleting the old R fixture leaves no external cross-language parity anchor for the new refit bootstrap; validation is currently same-library only. Concrete fix: none for approval; add the tracked R/Julia parity harness. TODO.md:L108-L108

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new placebo-feasibility error text on the power-analysis surface still recommends variance_method='bootstrap' even on survey-weighted custom-DGP paths, but SyntheticDiD.fit() now rejects bootstrap for every survey design. Concrete fix: branch the fallback guidance in _check_sdid_placebo_data() and the registry-path check so survey-weighted SDID only suggests jackknife (or adding controls). diff_diff/power.py:L714-L721, diff_diff/power.py:L2043-L2051, diff_diff/synthetic_did.py:L334-L342
Severity P3. Impact: the committed coverage artifact records "library_version": "3.2.0" even though it documents unreleased refit-bootstrap behavior, which weakens provenance for future methodology audits/regenerations. Concrete fix: emit a git SHA or explicit dev/unreleased marker in the artifact metadata, then regenerate benchmarks/data/sdid_coverage.json. benchmarks/python/coverage_sdid.py:L343-L374, benchmarks/data/sdid_coverage.json:L2-L8
Otherwise no findings. The changed tests cover the main regression surfaces: refit-bootstrap behavior/non-convergence, survey rejection, setter rollback, result/report labeling, and coverage-artifact schema. tests/test_methodology_sdid.py:L500-L715, tests/test_methodology_sdid.py:L1234-L1311, tests/test_methodology_sdid.py:L2968-L3010, tests/test_business_report.py:L4658-L4700, tests/test_survey_phase5.py:L179-L359

Foundation for restoring SDID survey-bootstrap support (PR #352, follow-up to #351 which front-door rejected all survey designs). This commit adds the weighted-FW kernel + Python wrappers; the bootstrap integration lands in the next commit. Rust (rust/src/weights.rs, rust/src/lib.rs): - New `sc_weight_fw_gram_weighted` and `sc_weight_fw_standard_weighted` loop variants. Identical to the unweighted loops except for the regularization term: `half_grad[j]` picks up `eta*reg_w[j]*lam[j]` in place of `eta*lam[j]`, and the FW step-size denominator uses the diag(reg_w)-weighted simplex direction norm `Σ_j reg_w[j]*d[j]²` (which simplifies to `Σ_j reg_w[j]*lam[j]² + reg_w[i] - 2*reg_w[i]*lam[i]` for d = e_i - lam). - New `sc_weight_fw_weighted_internal` dispatcher that delegates to the unweighted internal when reg_weights is None (preserves the legacy numeric contract for any future caller that wants the generic shape). - Two new pyfunctions: `sc_weight_fw_weighted` and `sc_weight_fw_weighted_with_convergence`. Same call shape as the existing unweighted siblings plus a trailing `reg_weights` kwarg. Registered in lib.rs. - 3 new Rust unit tests in rust/src/weights.rs: * test_weighted_fw_reg_weights_none_delegates — bit-identity at rel=1e-14 against the unweighted internal. * test_weighted_fw_uniform_reg_weights_matches_unweighted — uniform rw=1 collapses to uniform regularization (rel=1e-12, allowing for ULP-scale drift from different float reduction orders). * test_weighted_fw_simplex_invariants — for arbitrary positive rw and both gram (T0<N) and standard (T0>=N) paths, returned ω sums to 1 and is non-negative. Python (diff_diff/utils.py, diff_diff/_backend.py): - Export _rust_sc_weight_fw_weighted and _with_convergence from _backend (mirrors the shape added for _rust_sc_weight_fw_with_convergence in PR #351 c0d089b). - Extend `_sc_weight_fw` and `_sc_weight_fw_numpy` with a `reg_weights: Optional[np.ndarray] = None` kwarg. When set on the Rust path, dispatches to the new weighted pyfunctions; on the pure-Python path, runs a weighted FW loop mirroring the Rust derivation. - New helper `compute_sdid_unit_weights_survey(Y_pre_control, Y_pre_treated_mean, rw_control, ...)`: column-scales Y_pre_control by rw_control and passes rw_control as reg_weights so the FW solves the unit-weight survey-bootstrap objective min_{ω simplex} Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - treated_pre[t])² + ζ²·Σ_i rw_i·ω_i² Two-pass sparsify-refit structure mirrors compute_sdid_unit_weights. Returns ω on the standard simplex (caller composes ω_eff downstream). - New helper `compute_time_weights_survey(Y_pre_control, Y_post_control, rw_control, ...)`: row-scales Y_time by sqrt(rw_control) and passes no reg_weights (uniform reg on λ — λ is per-period, rw is per-control, no alignment for per-λ weighting). Two-pass structure unchanged. - Both new helpers expose `return_convergence=True` returning the AND of the two pass convergence flags, mirroring the contract added in PR #351 c0d089b. Tests (tests/test_weighted_fw.py — new, 15 tests): - _sc_weight_fw weighted-reg path: reg_weights=None matches unweighted at bit-identity; uniform reg matches unweighted at rel=1e-12; Rust/numpy parity at rel=1e-9; simplex invariants under arbitrary rw; return_convergence tuple shape. - compute_sdid_unit_weights_survey: uniform-rw equivalence to unweighted helper, simplex invariants under arbitrary rw, shape-mismatch raises, return_convergence AND. - compute_time_weights_survey: same coverage matrix, plus a zero-rw subset test (Rao-Wu-style undrawn PSU yields valid simplex λ). - Backend parity: pure-Python vs Rust weighted-helper output at rel=1e-7 for both unit and time helpers (monkeypatches HAS_RUST_BACKEND). ABI preservation: existing unweighted callers of _sc_weight_fw, compute_sdid_unit_weights, compute_time_weights are unaffected — the new kwarg defaults to None and dispatches to the legacy code path. The bit-identity check on TestScaleEquivariance::test_baseline_parity_small _scale[bootstrap] still passes at rel=1e-14 (verified in the next commit when the bootstrap integration lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sition PR #352 restores the SDID survey-bootstrap capability that PR #351 front- door rejected as a known regression. Pweight-only and full-design surveys now both succeed; placebo / jackknife continue to reject full designs (a separate methodology gap tracked in TODO.md). `diff_diff/synthetic_did.py::fit` (guards): - Replace the unconditional strata/PSU/FPC NotImpl guard with a method- gated version that fires only for placebo / jackknife. Rationale + truth-table in REGISTRY.md §SyntheticDiD survey-support matrix: method pweight-only strata/PSU/FPC bootstrap ✓ (this PR) ✓ Rao-Wu (this PR) placebo ✓ unchanged ✗ NotImpl (separate gap) jackknife ✓ unchanged ✗ NotImpl (separate gap) - Delete the unconditional `bootstrap + any-survey` guard added in #351. Keep the `weight_type != "pweight"` validation (fweight/aweight still rejected). `diff_diff/synthetic_did.py::fit` (survey resolution): - After validating the per-unit survey weights (`w_treated`, `w_control`), also collapse the observation-level `resolved_survey` to a unit-level view via `collapse_survey_to_unit_level(...)` ordered as `[*control_units, *treated_units]`. The resulting `resolved_survey_unit` is what `_bootstrap_se` slices via `boot_rw[:n_control]` / `boot_rw[n_control:]` per Rao-Wu draw. `diff_diff/synthetic_did.py::fit` (dispatcher): - Branch the bootstrap call on whether the design is pweight-only or full design (strata/PSU/FPC). Pass `w_control`/`w_treated` for pweight-only, `resolved_survey=resolved_survey_unit` for full design, None/None for non-survey. `diff_diff/synthetic_did.py::_bootstrap_se`: - New kwargs: `w_control`, `w_treated`, `resolved_survey` (all keyword- only, default None — preserves the legacy signature). - Single-PSU short-circuit: unstratified survey with <2 PSUs returns (NaN, []) since the bootstrap distribution is unidentified (resampling one PSU yields the same subset every draw). Recovered from the pre-PR-#351 fixed-weight Rao-Wu branch (commit 91082e5). - Per-draw Rao-Wu rescaling for full designs: ``rw = generate_rao_wu_weights(resolved_survey, rng)`` sliced over the resampled units. Pweight-only path uses ``rw = w_control[boot_idx]`` (constant per draw, no rescaling). - Survey-weighted treated-unit means: ``np.average(..., weights=rw_treated_draw)`` when survey weights are present. - Warm-start: the simplex init scales by rw before sum_normalize when on the survey path, matching the per-draw weighted-FW geometry. - Per-draw FW dispatch: survey paths call the new ``compute_sdid_unit_weights_survey`` / ``compute_time_weights_survey`` helpers (PR #352 commit 1) which run the weighted-FW kernel; non- survey paths continue to call the unweighted helpers (bit-identity preserved on the non-survey refit path). - Post-FW composition: ``ω_eff = rw·ω / Σ(rw·ω)`` for the SDID estimator (which expects simplex weights). Degenerate-retry if ``Σ(rw·ω) <= 0`` (all mass on rw=0 controls). - Aggregate FW non-convergence warning: tally is the AND of the two helpers' convergence flags per draw, fires above 5% (PR #351 c0d089b shape preserved, no copy change). Tests: - ``tests/test_survey_phase5.py``: rewrite three PR #351 raises-tests as succeeds-tests with explicit SE assertions — * ``test_full_design_bootstrap_succeeds`` (was ``_raises``): finite SE, populated survey_metadata.n_strata/n_psu, summary() includes Survey Design + Bootstrap replications blocks. * ``test_bootstrap_with_pweight_only_succeeds`` (was ``_raises``): finite SE, variance_method preserved (cross-surface guard). * New ``test_bootstrap_full_design_se_differs_from_pweight_only`` resurrects the PR #351 R3-deleted differs-from contract: ATT matches between paths (both compose ω_eff post-fit) but SE differs (Rao-Wu adds PSU clustering variance). - ``tests/test_methodology_sdid.py::TestBootstrapSE``: rewrite two PR #351 raises-tests as succeeds-tests, plus add the ``test_bootstrap_single_psu_returns_nan`` short-circuit regression. Verified: 308 tests pass across test_methodology_sdid / test_business_report SDID subset / test_rust_backend / test_survey_phase5 / test_weighted_fw / test_guides. Bit-identity check: the non-survey refit path goes through the unweighted helpers (no weighted-FW dispatch), so ``TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]`` remains at rel=1e-14 — verified passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capstone of PR #352. Validates the new weighted-FW + Rao-Wu bootstrap composition and propagates the landed capability across the documentation surfaces. Coverage MC harness (benchmarks/python/coverage_sdid.py): - Add ``stratified_survey`` as a 4th DGP in ``ALL_DGPS``. Uses ``generate_survey_did_data`` to produce an N=40 (strata=2, PSU=2/ stratum) null-treatment panel with moderate weight variation and modest ICC (``psu_re_sd=1.5``). Cohort 7 → post = 7..11 (5 post periods). Converts per-observation ``treated`` to a unit-level ever-treated indicator (SDID's block-treatment requirement). - Extend ``DGPSpec`` with an optional ``survey_design_factory`` callable that returns ``(SurveyDesign, supported_methods_tuple)``. For ``stratified_survey``: bootstrap only — placebo / jackknife reject strata/PSU/FPC at fit-time, so the harness skips them rather than catching the NotImplementedError inside ``_fit_one``. - ``_fit_one`` gains an optional ``survey_design`` kwarg routed through ``SyntheticDiD.fit(survey_design=)``. ``_run_dgp`` calls the factory once per seed (DataFrame contents don't affect columns) and gates methods on the supported set. Regenerated ``benchmarks/data/sdid_coverage.json`` via ``python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200``. Total wall-clock 2421 s (~40 min on M-series Mac, Rust backend); aer63 remains the long tail at 2237 s, stratified_survey adds only 33 s. Calibration gate (plan §2.7): ``stratified_survey × bootstrap`` at α=0.05 returns 0.042 (500 seeds × B=200), inside the calibration band [0.02, 0.10]. ``mean SE / true SD = 1.25`` indicates the bootstrap is slightly conservative (overestimates empirical sampling SD by ~25%) — the safer direction under Rao-Wu rescaling with only 4 PSUs total. Validates the weighted-FW + Rao-Wu composition end-to-end. REGISTRY.md §SyntheticDiD: - Add ``stratified_survey`` row to the coverage MC table and a paragraph under it documenting the calibration verdict, the conservatism direction, and why placebo/jackknife rows are NaN. - Replace the survey-support bullet with a truth-table matrix (PR #352 shape); add a ``Note (survey + bootstrap composition)`` documenting the weighted-FW objective (unit and time forms), the ω_eff composition, the argmin-set caveat, the per-draw rw dispatch (pweight-only vs Rao-Wu), and the single-PSU short-circuit. - Update the ``Note (default variance_method deviation from R)`` to drop the "bootstrap rejects surveys" framing (no longer accurate). - Update the ``Note (coverage Monte Carlo calibration)`` header to say "4 representative null-panel DGPs" and flag stratified_survey as bootstrap-only. User-facing docs: - ``docs/methodology/survey-theory.md``: restore SDID in the Rao-Wu Rescaled Bootstrap list; describe the weighted-FW composition. - ``docs/survey-roadmap.md``: Phase 5 SDID row updated to reflect full-design bootstrap support via PR #352; Phase 6 Rao-Wu bullet restores SDID. - ``docs/tutorials/16_survey_did.ipynb`` cell-35: support matrix table row for SyntheticDiD switches from "pweight only (placebo/ jackknife)" to "bootstrap only (PR #352) for strata/PSU/FPC"; "Note on SyntheticDiD" block rewritten for the landed contract. - ``diff_diff/synthetic_did.py`` ``__init__`` docstring: bootstrap bullet now describes survey support and the ω_eff composition. - ``diff_diff/guides/llms-full.txt``: survey-aware bootstrap bullet includes SDID in the Rao-Wu list with the weighted-FW formula. CHANGELOG.md: - Retain the PR #351 regression Changed entry but annotate it as "restored in PR #352"; add new Added/Changed PR #352 entries documenting the weighted-FW kernel, survey helpers, _bootstrap_se Rao-Wu composition, and the new coverage MC row. TODO.md: - Row 103 (SDID + survey designs) → closed by PR #352; replaced with a narrower follow-up for placebo/jackknife + strata/PSU/FPC (Low priority, no concrete sketch yet). Tests: - ``TestCoverageMCArtifact`` extended: 4 DGPs asserted (including ``stratified_survey``); new explicit assertions that the stratified_survey bootstrap row has ≥100 successful fits and α=0.05 rejection ∈ [0.02, 0.10]; placebo/jackknife rows n_successful_fits == 0 (strata/PSU/FPC rejection contract). Verified: TestCoverageMCArtifact passes against the regenerated artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 3 commits April 22, 2026 11:56

igerber and others added 3 commits April 22, 2026 16:07

igerber added the ready-for-ci Triggers CI test workflows label Apr 22, 2026

igerber removed the ready-for-ci Triggers CI test workflows label Apr 23, 2026

igerber added the ready-for-ci Triggers CI test workflows label Apr 23, 2026

igerber merged commit 172d1d8 into main Apr 23, 2026
23 of 24 checks passed

igerber deleted the sdid-bootstrap-refit branch April 23, 2026 23:14

igerber mentioned this pull request Apr 24, 2026

Restore SDID survey-bootstrap via weighted Frank-Wolfe + Rao-Wu composition #355

Merged

igerber mentioned this pull request Apr 25, 2026

Release 3.3.0: HAD estimator, profile_panel, dCDH by_path, SDID survey complete #368

Merged

Conversation

igerber commented Apr 22, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

igerber commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

igerber commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant